Re: VOTE: moving commits to git-wp.o.a github PR features.
+1 On May 16, 2014, at 2:02 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hi, I would like to initiate a procedural vote moving to git as our primary commit system, and using github PRs as described in Jake Farrel's email to @dev [1] [1] https://blogs.apache.org/infra/entry/improved_integration_between_apache_and If voting succeeds, i will file a ticket with infra to commence necessary changes and to move our project to git-wp as primary source for commits as well as add github integration features [1]. (I assume pure git commits will be required after that's done, with no svn commits allowed). The motivation is to engage GIT and github PR features as described, and avoid git mirror history messes like we've seen associated with authors.txt file fluctations. PMC and committers have binding votes, so please vote. Lazy consensus with minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time for weekend (i.e. Tuesday afternoon PST) . here is my +1 -d Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: [MAHOUT-EXAMPLES Jenkins] Mahout-Examples-Cluster-Reuters-II - Build # 831 - Still Failing
I gave a few more people access: Frank, you and Andrew. Happy to add others. -Grant On May 2, 2014, at 2:29 PM, Sebastian Schelter s...@apache.org wrote: Do we have access now to fix the build? This becomes really annoying, we only have to change a few lines in the jenkins config... On 05/02/2014 08:24 PM, Apache Jenkins Server wrote: The Apache Jenkins build system has built Mahout-Examples-Cluster-Reuters-II (build #831) Status: Still Failing Check console output at https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/831/ to view the results.
Re: Board Report
On Apr 7, 2014, at 10:29 AM, Pat Ferrel p...@occamsmachete.com wrote: The document does not mention the state of the existing Spark work in the snapshot codebase. Shouldn’t this be noted? It's under the community section. On Apr 7, 2014, at 5:06 AM, Sebastian Schelter s...@apache.org wrote: I think we should mention the redesign/rework of the website and the completion of the move from the old wiki to Apache CMS. --sebastian On 04/07/2014 02:04 PM, Grant Ingersoll wrote: Here is my proposed report. For the most part, I think the only right thing to do vis-a-vis the Board is to report that we are in the midst of a healthy (yes, I believe it is, for the most part healthy and normal) discussion on where to go next. PMC Members: this is checked into SVN at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt. It is due on Wednesday. If you object to this approach of reporting, please let me know ASAP and suggest alternatives. === Apache Mahout Status Report: April 2014 === - Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining Project Status -- The project continues to have a large and active user base. While the developer base has continued to grow, there is a very active and healthy debate going on about where Mahout goes next. Please see the Issues section below for more details. Community - * Andrew Musselman was voted in as new committer. * No changes to the PMC in the reporting period. * The main issue concerning the community right now is the addition of new contributions from 0xData and the integration of Mahout with Spark. Community Objectives Our goal is to build scalable machine learning libraries. See the Issues section below for the debate in the community about our objectives. Releases In addition to an ongoing debate on Mahout's future, the community is actively working on integrating Mahout with Scala/Spark, updating documentation, and bringing in new code and committers to update the core project. Issues -- The Mahout community is at a crossroads in terms of where to go next. While the project has a broad number of users and interested parties, most committers are trying to maintain the code base on a purely part time basis, when the amount of work to sustain these users clearly points to it needing to be full time. Furthermore, much of our original code base is written for Hadoop MapReduce 1.0, which many in the community have come to realize is not well-suited for solving the kinds of problems that Mahout has set out to solve. There have been several lengthy discussions and prototypes going on to work out next directions along the lines of the Spark and 0xData contributions (there are numerous threads on the dev@mahout.a.o mailing list.) The PMC does not think this requires Board intervention at this time as the debate is, as far as we can tell, healthy. We do, however, expect that this debate will take some time to resolve and may mean we won't be shipping a 1.0 release any time soon. We will keep the Board apprised of our next steps as we work through the process. On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote: To Sean's point, if Mahout were my company, I would do the following, albeit pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do so: 1. Clean up existing code with a laser focus on a few key areas (Sebastian's list makes sense) using a part of the team and call it 1.0 and ship it, as it has a number of users and they deserve to not have the rug pulled out from under them. 2. Spin out a subset of the team to explore and prototype 2.0 based on two very positive and re-energizing looking ideas: a. Scala DSL (and maybe Spark) b. 0xData All of the work for #2 would be done in a clean repo and would only bring in legacy code where it was truly beneficial (back compat. can come later, if at all). It would then benchmark those two approaches as well as look at where they overlap and are mutually beneficial and then go forward with the winner. 3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal support as possible, encouraging, neh -- actively helping -- 1.0 customers upgrade as quickly as possible. The tricky part then becomes how do you make sure to still make your sales #'s while also convincing them that your roadmap is what they are really buying. If I didn't have the $$$ to do both of these (i.e. we need a massive turn around and we have one last shot), I would be all in on #2. --- That being said, Mahout is not my company. Heck, Mahout is not even
Re: Board Report
On Apr 7, 2014, at 11:03 AM, Pat Ferrel p...@occamsmachete.com wrote: Mahout needs a reboot. Grant has the right perspective, but I’d take it further. His #2 (two efforts) is not and never would be reasonable in anything but a huge company. FWIW, that was my view _if_ I were in a company funding it. Further down, my take is that for the most part we should follow the natural Apache way and let those who do the work make the choices, which AFAICT, point at forgetting about #1 and pursuing #2 only. -Grant I have never and would never take a team the size of Mahout (even with some new commiters) and split a reboot into two parts on two engines. No sane project manager would allow this. Why do we think it will work here? The recent Gigaom article left me sympathetic with how confused the readers must be, let alone potential users or contributors. Sean is not being nihilistic, two directions will not work for Mahout. Mahout has a bad reputation already for being a poorly documented and a poorly integrated loose collections of code with a lot of technical debt. Honestly has anyone reading this seen increasing interest in the project? A reboot is the only thing I can imagine to re-energize it and even that must be done with the utmost in clear communication. If you accept the above then there seem to be some ways forward: 1) reboot on Spark, let 0xdata do what they will. 2) reboot on 0xdata and let the Spark commiters consider becoming MLlib commiters or other. 3) fail by issuing confusing direction statements, spending too much time supporting and reconciling multiple significantly disparate efforts and dividing commiters. This is such a classic fail that I have a hard time even considering it. I’d like to see #1 for what it’s worth. A concerted effort by all on #1 would ensure Mahout is included in future distros. Maybe even #2 would be included but #3? It’s a non-starter. On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote: To Sean's point, if Mahout were my company, I would do the following, albeit pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do so: 1. Clean up existing code with a laser focus on a few key areas (Sebastian's list makes sense) using a part of the team and call it 1.0 and ship it, as it has a number of users and they deserve to not have the rug pulled out from under them. 2. Spin out a subset of the team to explore and prototype 2.0 based on two very positive and re-energizing looking ideas: a. Scala DSL (and maybe Spark) b. 0xData All of the work for #2 would be done in a clean repo and would only bring in legacy code where it was truly beneficial (back compat. can come later, if at all). It would then benchmark those two approaches as well as look at where they overlap and are mutually beneficial and then go forward with the winner. 3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal support as possible, encouraging, neh -- actively helping -- 1.0 customers upgrade as quickly as possible. The tricky part then becomes how do you make sure to still make your sales #'s while also convincing them that your roadmap is what they are really buying. If I didn't have the $$$ to do both of these (i.e. we need a massive turn around and we have one last shot), I would be all in on #2. --- That being said, Mahout is not my company. Heck, Mahout is not even a company, so we don't need to be bound by company conventions and thought processes, even if that fits with all of our individual day jobs. And, thankfully, we don't have any sales numbers to make. We are chartered with one and only one mission: produce open source, scalable machine learning libraries under the Apache license and community driven principles. We are not required by the Board or anyone else to support version X for Y years or to use Hadoop or Scala or Java. We are also not required to implement any specific algorithms or deliver them on specific time frames. We are also not required to provide users upgrade paths or the like. Naturally, we _want_ to do these things for the sake of the community, but let's be clear: it is not a requirement from the ASF. We are, however, required, to have a sustaining community. I personally think we should start clean on #2, throwing off the shackles of the past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that, not 0.1 as Sebastian suggests, for marketing reasons) built on a completely new and fresh repository, likely bringing in only the Math/collections underpinnings and maybe the build system. This new repository would have only a handful of core algorithms that we know are well implemented, sustainable and best in class. I think we
Re: Board Report
To Sean's point, if Mahout were my company, I would do the following, albeit pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do so: 1. Clean up existing code with a laser focus on a few key areas (Sebastian's list makes sense) using a part of the team and call it 1.0 and ship it, as it has a number of users and they deserve to not have the rug pulled out from under them. 2. Spin out a subset of the team to explore and prototype 2.0 based on two very positive and re-energizing looking ideas: a. Scala DSL (and maybe Spark) b. 0xData All of the work for #2 would be done in a clean repo and would only bring in legacy code where it was truly beneficial (back compat. can come later, if at all). It would then benchmark those two approaches as well as look at where they overlap and are mutually beneficial and then go forward with the winner. 3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal support as possible, encouraging, neh -- actively helping -- 1.0 customers upgrade as quickly as possible. The tricky part then becomes how do you make sure to still make your sales #'s while also convincing them that your roadmap is what they are really buying. If I didn't have the $$$ to do both of these (i.e. we need a massive turn around and we have one last shot), I would be all in on #2. --- That being said, Mahout is not my company. Heck, Mahout is not even a company, so we don't need to be bound by company conventions and thought processes, even if that fits with all of our individual day jobs. And, thankfully, we don't have any sales numbers to make. We are chartered with one and only one mission: produce open source, scalable machine learning libraries under the Apache license and community driven principles. We are not required by the Board or anyone else to support version X for Y years or to use Hadoop or Scala or Java. We are also not required to implement any specific algorithms or deliver them on specific time frames. We are also not required to provide users upgrade paths or the like. Naturally, we _want_ to do these things for the sake of the community, but let's be clear: it is not a requirement from the ASF. We are, however, required, to have a sustaining community. I personally think we should start clean on #2, throwing off the shackles of the past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that, not 0.1 as Sebastian suggests, for marketing reasons) built on a completely new and fresh repository, likely bringing in only the Math/collections underpinnings and maybe the build system. This new repository would have only a handful of core algorithms that we know are well implemented, sustainable and best in class. I think we should look at the lead up to 0.9 as an experiment that proved out a lot of interesting ideas, including the fact that Mahout proved there is vast interest in open source large scale machine learning and that it is the benchmark for comparison. Not many other ML projects can say that, even if they have better technical implementations or are less fragmented. Once you realize something has outlived it's usefulness in software, however, there is no point in lingering. That being said, at least for the foreseeable future, I am not in a position to contribute much code. So, from my perspective, the ASF Meritocratic approach takes over: those who do the work make the decisions. If you want something in, then put up the patch and ask for feedback. If no one provides feedback, assume lazy consensus and move forward. Nothing convinces people better than actual, real, executing code. For my part, I am happy to continue to work the bureaucratic side of things to make sure reports get filed, credentials get created, etc. and the occasional patch. I hope one day I will have time to contribute again. I will follow up w/ a separate email on what I am going to put in the Board Report. On Apr 7, 2014, at 1:52 AM, Sean Owen sro...@gmail.com wrote: No, it's about the opposite. I'm referring to the default, current state of play here. The issues for a vendor are demand and supportability. Do people want to pay for support of X? Can you honestly say you have expertise to support and influence X over at least a major release cycle (12-18 months)? The latter needs a reasonably reliable roadmap and continuity. I'm suggesting that in the current state, demand is low and going down. The current code base seems de facto deprecated/unsupported already, and possibly to be removed or dramatically changed into something as-yet unclear. Nobody here seems to have taken a hard decision regarding a next major release, but, the trajectory of that decision seems clear if the current state remains the same. From my perspective,
Re: Board Report
Here is my proposed report. For the most part, I think the only right thing to do vis-a-vis the Board is to report that we are in the midst of a healthy (yes, I believe it is, for the most part healthy and normal) discussion on where to go next. PMC Members: this is checked into SVN at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt. It is due on Wednesday. If you object to this approach of reporting, please let me know ASAP and suggest alternatives. === Apache Mahout Status Report: April 2014 === - Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining Project Status -- The project continues to have a large and active user base. While the developer base has continued to grow, there is a very active and healthy debate going on about where Mahout goes next. Please see the Issues section below for more details. Community - * Andrew Musselman was voted in as new committer. * No changes to the PMC in the reporting period. * The main issue concerning the community right now is the addition of new contributions from 0xData and the integration of Mahout with Spark. Community Objectives Our goal is to build scalable machine learning libraries. See the Issues section below for the debate in the community about our objectives. Releases In addition to an ongoing debate on Mahout's future, the community is actively working on integrating Mahout with Scala/Spark, updating documentation, and bringing in new code and committers to update the core project. Issues -- The Mahout community is at a crossroads in terms of where to go next. While the project has a broad number of users and interested parties, most committers are trying to maintain the code base on a purely part time basis, when the amount of work to sustain these users clearly points to it needing to be full time. Furthermore, much of our original code base is written for Hadoop MapReduce 1.0, which many in the community have come to realize is not well-suited for solving the kinds of problems that Mahout has set out to solve. There have been several lengthy discussions and prototypes going on to work out next directions along the lines of the Spark and 0xData contributions (there are numerous threads on the dev@mahout.a.o mailing list.) The PMC does not think this requires Board intervention at this time as the debate is, as far as we can tell, healthy. We do, however, expect that this debate will take some time to resolve and may mean we won't be shipping a 1.0 release any time soon. We will keep the Board apprised of our next steps as we work through the process. On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote: To Sean's point, if Mahout were my company, I would do the following, albeit pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do so: 1. Clean up existing code with a laser focus on a few key areas (Sebastian's list makes sense) using a part of the team and call it 1.0 and ship it, as it has a number of users and they deserve to not have the rug pulled out from under them. 2. Spin out a subset of the team to explore and prototype 2.0 based on two very positive and re-energizing looking ideas: a. Scala DSL (and maybe Spark) b. 0xData All of the work for #2 would be done in a clean repo and would only bring in legacy code where it was truly beneficial (back compat. can come later, if at all). It would then benchmark those two approaches as well as look at where they overlap and are mutually beneficial and then go forward with the winner. 3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal support as possible, encouraging, neh -- actively helping -- 1.0 customers upgrade as quickly as possible. The tricky part then becomes how do you make sure to still make your sales #'s while also convincing them that your roadmap is what they are really buying. If I didn't have the $$$ to do both of these (i.e. we need a massive turn around and we have one last shot), I would be all in on #2. --- That being said, Mahout is not my company. Heck, Mahout is not even a company, so we don't need to be bound by company conventions and thought processes, even if that fits with all of our individual day jobs. And, thankfully, we don't have any sales numbers to make. We are chartered with one and only one mission: produce open source, scalable machine learning libraries under the Apache license and community driven principles. We are not required by the Board or anyone else to support version X for Y years or to use Hadoop or Scala or Java. We are also not required to implement any specific algorithms
Re: Board Report
Good point, please update the report (you should have credentials) -Grant On Apr 7, 2014, at 5:06 AM, Sebastian Schelter s...@apache.org wrote: I think we should mention the redesign/rework of the website and the completion of the move from the old wiki to Apache CMS. --sebastian On 04/07/2014 02:04 PM, Grant Ingersoll wrote: Here is my proposed report. For the most part, I think the only right thing to do vis-a-vis the Board is to report that we are in the midst of a healthy (yes, I believe it is, for the most part healthy and normal) discussion on where to go next. PMC Members: this is checked into SVN at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt. It is due on Wednesday. If you object to this approach of reporting, please let me know ASAP and suggest alternatives. === Apache Mahout Status Report: April 2014 === - Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining Project Status -- The project continues to have a large and active user base. While the developer base has continued to grow, there is a very active and healthy debate going on about where Mahout goes next. Please see the Issues section below for more details. Community - * Andrew Musselman was voted in as new committer. * No changes to the PMC in the reporting period. * The main issue concerning the community right now is the addition of new contributions from 0xData and the integration of Mahout with Spark. Community Objectives Our goal is to build scalable machine learning libraries. See the Issues section below for the debate in the community about our objectives. Releases In addition to an ongoing debate on Mahout's future, the community is actively working on integrating Mahout with Scala/Spark, updating documentation, and bringing in new code and committers to update the core project. Issues -- The Mahout community is at a crossroads in terms of where to go next. While the project has a broad number of users and interested parties, most committers are trying to maintain the code base on a purely part time basis, when the amount of work to sustain these users clearly points to it needing to be full time. Furthermore, much of our original code base is written for Hadoop MapReduce 1.0, which many in the community have come to realize is not well-suited for solving the kinds of problems that Mahout has set out to solve. There have been several lengthy discussions and prototypes going on to work out next directions along the lines of the Spark and 0xData contributions (there are numerous threads on the dev@mahout.a.o mailing list.) The PMC does not think this requires Board intervention at this time as the debate is, as far as we can tell, healthy. We do, however, expect that this debate will take some time to resolve and may mean we won't be shipping a 1.0 release any time soon. We will keep the Board apprised of our next steps as we work through the process. On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote: To Sean's point, if Mahout were my company, I would do the following, albeit pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do so: 1. Clean up existing code with a laser focus on a few key areas (Sebastian's list makes sense) using a part of the team and call it 1.0 and ship it, as it has a number of users and they deserve to not have the rug pulled out from under them. 2. Spin out a subset of the team to explore and prototype 2.0 based on two very positive and re-energizing looking ideas: a. Scala DSL (and maybe Spark) b. 0xData All of the work for #2 would be done in a clean repo and would only bring in legacy code where it was truly beneficial (back compat. can come later, if at all). It would then benchmark those two approaches as well as look at where they overlap and are mutually beneficial and then go forward with the winner. 3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal support as possible, encouraging, neh -- actively helping -- 1.0 customers upgrade as quickly as possible. The tricky part then becomes how do you make sure to still make your sales #'s while also convincing them that your roadmap is what they are really buying. If I didn't have the $$$ to do both of these (i.e. we need a massive turn around and we have one last shot), I would be all in on #2. --- That being said, Mahout is not my company. Heck, Mahout is not even a company, so we don't need to be bound by company conventions and thought processes, even if that fits with all of our individual day jobs
Re: Mail and IRC parsing
We've (LucidWorks) got full indexing and search of the Mahout mail archives at http://find.searchub.org. We could probably add in IRC pretty easily if you want. -Grant On Mar 22, 2014, at 2:06 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I put up a parser for the IRC history logs here https://github.com/andrewmusselman/util/blob/master/irc-parser.sh I'd like to write one for the user list too to figure out the most common problems/questions so we can focus effort on repairs to bugs and docs. But the mail archives at https://mail-archives.apache.org/mod_mbox/mahout-user/ are dynamic, loaded in through JavaScript, so parsing them isn't that straightforward. Is it possible to get the mbox files directly? Grant Ingersoll | @gsingers http://www.lucidworks.com
Board Report
Can someone summarize the 0xData and the Spark work for me for the board report? I've unfortunately been too busy to keep up on the threads on it, but need to write the board report for this month. You can either summarize here or add it to the community section at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt Also, assuming we are going ahead w/ the 0xData stuff, we likely need to do a software grant for that. Thanks, Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0xdata interested in contributing
in the framework to be created, managed and deleted. There is also an R binding for h2o which allows programs to access and manage h2o objects. Functions defined in an R-like language can be applied in parallel to data frames stored in the h2o framework. *Proposed Developer User Experience* I see several kinds of users. These include numerical developers (largely mathematicians), Java or Scala developers (like current Mahout devs), and data analysts. - Local h2o single-node cluster - Temporary h2o cluster - Shared h2o cluster All of these modes will be facilitated by the proposed development. *Complementarity with Other Platforms* I view h2o as complementary with Hadoop and Spark because it provides a solid in-memory execution engine as opposed to a general out-of-core computation model that other map-reduce engines like Hadoop and Spark implement or more general dataflow systems like Stratosphere, Tez or Drill. Also, h2o provides no persistence but depends on other systems for that such as NFS, HDFS, NAS or MapR. H2o is also nicely complimentary to R in that R can invoke operations and move data to and from h2o very easily. *Required Additional Work* Sparse matrices Linear algebra bindings Class-file magic to allow off-the-cuff function definitions Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: MAHOUT 0.9 Release - New URL
+1 from me. On Jan 22, 2014, at 5:55 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Fixed the issues that were reported this week and restored FP mining into the codebase. Here's the URL for the final release in staging:- https://repository.apache.org/content/repositories/orgapachemahout-1003/org/apache/mahout/mahout-distribution/0.9/ The artifacts have been signed with the following key: https://people.apache.org/keys/committer/smarthi.asc a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Committers and PMC, need a minimum of 3 '+1' votes for the release to be finalized. Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Mahout 0.9 Release - Call for Volunteers
$MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Committers and PMC members: --- Need atleast 3 +1 votes from this group for the Release to pass. Thanks and Regards. -- Thanks, Chameera -- -- Yexi Jiang, ECS 251, yjian...@cs.fiu.edu School of Computer and Information Science, Florida International University Homepage: http://users.cis.fiu.edu/~yjian004/ -- -- Yexi Jiang, ECS 251, yjian...@cs.fiu.edu School of Computer and Information Science, Florida International University Homepage: http://users.cis.fiu.edu/~yjian004/ Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: MAHOUT 0.9 Release - New URL
Ran the tests, verified sigs, tried out a few of the examples. +1 (binding) On Jan 16, 2014, at 9:41 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Third time's a Charm!!! Here's the new URL for Mahout 0.9 Release: https://repository.apache.org/content/repositories/orgapachemahout-1002/org/apache/mahout/mahout-distribution/0.9/ For those volunteering to test this, some of the things to be verified: a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Committers and PMC members: --- Need 'at least 3 +1 votes' for the Release to pass. Thanks and Regards.
[jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
[ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809813#comment-13809813 ] Grant Ingersoll commented on MAHOUT-1030: - Andrew, I suppose it depends on what part of it you want to address. If it is the literal part of this bug, Pat has been pretty responsive. If it is the reworking of the properties of vectors, that is probably best handled on the mailing list. The basic gist being we want to more intelligently handle vector properties and get rid of NamedVector. [~tdunning], [~robinanil] and others may have some thoughts here as well. (FWIW, I'd prefer the latter to be tackled.) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable Key: MAHOUT-1030 URL: https://issues.apache.org/jira/browse/MAHOUT-1030 Project: Mahout Issue Type: Bug Components: Clustering, Integration Affects Versions: 0.7 Reporter: Jeff Eastman Assignee: Andrew Musselman Fix For: 1.0, 0.9 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results. On 6/8/12 12:21 PM, Jeff Eastman wrote: That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight. On 6/7/12 1:00 PM, Pat Ferrel wrote: It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid? I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large. Am I missing something? -- This message was sent by Atlassian JIRA (v6.1#6144)
Re: Mahout's future
On Oct 17, 2013, at 7:46 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: 8. Mahout-627: Baum Welch Algorithm on MapReduce for Parallel HMM Training Grant, do we need to push this to Backlog? Yes. Sorry for the delay, in a new role at work that is consuming most of my cycles at this point in time.
Re: Mahout's future
On Oct 15, 2013, at 1:21 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Will schedule a hangout for this Thursday - 7pm (Eastern Time) tentatively. Sorry, just catching up. I can't make this week, but can next. Feel free to go ahead w/o me at this point, given the momentum. I would like us to first discuss about Mahout 0.9 release, will send out an agenda once I schedule it. Regards, Suneel On Tuesday, October 15, 2013 12:24 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Following up , Suneel/Grant are we still on for meeting this week on a google hangout, would love to neet this week. From: sxk1...@hotmail.com To: dev@mahout.apache.org Subject: RE: Mahout's future Date: Sun, 6 Oct 2013 07:00:50 -0700 +1Can you send out a quick agenda (hopefully with my input incorporated) before the hangout?Regards Date: Sun, 6 Oct 2013 03:58:10 -0700 From: suneel_mar...@yahoo.com Subject: Re: Mahout's future To: dev@mahout.apache.org Grant would be available the week of Oct 14 for a hangout (tentatively). We could go ahead and schedule one next week if there's (and seems very much like it) enough response. I can go ahead and facilitate one. I will be 100% focused on Mahout from next week once I start at my new job from Monday. Regarding building something for Deep Learning, Yexi's patch for MLP (see M-1265) may be a good place to refactor/start thinking about the foundations. I guess Ted is alluring to build something like what's been described in the Google paper (see http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf). Correct? Suneel From: Ted Dunning ted.dunn...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Cc: dev@mahout.apache.org dev@mahout.apache.org Sent: Sunday, October 6, 2013 2:10 AM Subject: Re: Mahout's future Saikat These are all good suggestions. I would have a hard time suggesting a prioritization of them. Does anybody remember what grant said about having another hangout? Sent from my iPhone On Oct 6, 2013, at 7:15, Saikat Kanjilal sxk1...@hotmail.com wrote: I wanted to mention a few other things:1)It might be useful to take and embed a few already productionalized use cases into the integration tests in mahout, this will help additional users get on board faster2) Deep learning is really interesting, however I'd like to help research some common use cases first before tying this into mahout3) It'd be good to put some thought into documenting when you would choose what type of algorithm given a production machine learning recommendation system to build, this would give more visibility for users into choosing the right mixture of algorithms to build a production ready recommender, often what I've found is that a bulk of the time in building productionalized recommenders is spent cleaning and filtering noisy data4) I'd like to also explore how to tie in machine learning algorithms into real time systems built using twitter storm (http://storm-project.net/), it seems that industry more and more is wanting to do real time analytics on the fly, I'm curious what type of algorithms we'd need for this and back propagate these into mahout It'd be good to meet like minded devs together locally (Seattle) or over gtalk/conference to talk through possibilities. Regards From: ted.dunn...@gmail.com Date: Sat, 5 Oct 2013 18:13:40 -0700 Subject: Re: Mahout's future To: dev@mahout.apache.org On Sat, Oct 5, 2013 at 5:08 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Does it make sense to have a quick meeting of interested developers over google chat/conference rather than email to discuss and assign folks to specifics? Thoughts? Great idea. I think that Grant may have been organizing a hangout. Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Reconsider moving to Apache CMS only? [Was: Confluence Wiki SPAM and new restrictions in place.]
On Oct 2, 2013, at 7:36 AM, Isabel Drost-Fromm isa...@apache.org wrote: Hi, this topic popped up a couple of times in the past - given the current spam incident in the Apache confluence wikis, a few more restrictions were put into place for editing pages in the wiki: removed any editing access for the confluence-users group. From now on, if someone wants to edit your wiki, you have to whitelist them specifically. You can do this if they are a committer by listing them in the 'Individual Users' section of the Space Permissions area, or by asking that they be added to the special 'asf-cla' wiki group - we will check that they have a iCLA on file before adding them. Given the need to whitelist everyone who wants to do changes to the wiki pages I wonder whether it makes sense to move most of our docs over to Apache CMS (except maybe for the most volatile pages, if there are any). I think it does make sense. The obvious disadvantage would be a higher barrier of entry for people providing docs (though prior to being whitelisted one would have to express the intent to provide improvments on the mailing list anyway). The advantage could be a clearer path towards committership for those not working on code but on technical writing. I like what we do over in Solr land, an official reference guide that is maintained by committers + patches and then a wiki which allows free editing (for the most part). Things generally move from the wiki to the Ref guide. The only question concerning the move to Apache CMS I have: How easy is it to provide documentation for individtual released versions? Would it be possible to e.g. bundle the then current docs with the release? It's all in SVN and is usually markdown. Tag it and ship it!
Re: 0.9?
Hi Ted, This sounds good to the extent we can get them done. Do you have JIRA issues for any of these open? November isn't hard and fast for 0.9, but I suspect it will be January if we push things out. -Grant On Sep 28, 2013, at 1:59 PM, Ted Dunning ted.dunn...@gmail.com wrote: The one large-ish feature that I think would find general use would be a high performance classifier trainer. Flor cleanup sort of thing it would be good to fully integrate the streaming k-means into the normal clustering commands while revamping the command line API. Dmitriy's recent scala work would help quite a bit before 1.0. Not sure it can make 0.9. For recommendations, I think that the demo system that pat started with the elaborations by Ellen an Tim would be very good to have. I would be happy to collaborate with somebody on these but am not at all likely to have time to actually do them end to end. Sent from my iPhone On Sep 28, 2013, at 12:40, Grant Ingersoll gsing...@apache.org wrote: Moving closer to 1.0, removing cruft, etc. Do we have any more major features planned for 1.0? I think we said during 0.8 that we would try to follow pretty quickly w/ another release. -Grant On Sep 28, 2013, at 12:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: Sounds right in principle but perhaps a bit soon. What would define the release? Sent from my iPhone On Sep 27, 2013, at 7:48, Grant Ingersoll gsing...@apache.org wrote: Anyone interested in thinking about 0.9 in the early Nov. time frame? -Grant Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Fwd: ASF Board Report - Initial Reminder for Oct 2013
FYI. I'll circulate a draft this week. Begin forwarded message: From: ASF Board bo...@apache.org Subject: ASF Board Report - Initial Reminder for Oct 2013 Date: September 29, 2013 3:29:09 PM EDT To: Grant Ingersoll gsing...@apache.org This email was sent by an automated system on behalf of the ASF Board. It is an initial reminder to give you plenty of time to prepare the report. The meeting is scheduled for Wed, 16 October 2013, 10:30:00:00 PST and the deadline for submitting your report is 1 full week prior to that (Wed, Oct 9th)! According to board records, you are listed as the chair of at least one committee that is due to submit a report this month. [1] [2] Details on which project reports are due and how to submit a report are enclosed below. Please submit your report with sufficient time to allow the board members to review and digest. Again, the very latest you should submit your report is 1 full week (7days) prior to the board meeting (Wed, Oct 9th). If you feel that an error has been made, please consult [1] and if there is still an issue then contact the board directly. As always, PMC chairs are welcome to attend the board meeting. Thanks, The ASF Board [1] - https://svn.apache.org/repos/private/committers/board/committee-info.txt [2] - https://svn.apache.org/repos/private/committers/board/calendar.txt [3] - https://svn.apache.org/repos/private/committers/board/templates Submitting your Report -- Full details about the process and schedule are in [1]. The report should be committed to the meeting agenda in the board directory in the foundation repository, trying to keep a similar format to the others. This can be found at: https://svn.apache.org/repos/private/foundation/board Your report should also be sent in plain-text format to bo...@apache.org with a Subject line that follows the below format: Subject: [REPORT] Project Name Cutting and pasting directly from a Wiki is not acceptable due to formatting issues. Line lengths should be limited to 77 characters. Resolutions --- There are several templates for use for various Board resolutions. They can be found in [3] and you are encouraged to use them. It is strongly recommended that if you have a resolution before the board, you are encouraged to attend that board meeting. ASF Board Reports - Reports are due from you for the following committees: - Mahout Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0.9?
On Sep 27, 2013, at 9:07 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: I was gonna bring this up myself next week (and was chatting with Isabel about it today morning). I was thinking of the following for 0.9:- 1. We have already removed the algorithms that have been marked as deprecated in 0.8 2. Bugs that have been fixed since 0.8. 3. New Features in 0.9 could include :- a) New Multilayer Perceptron that Yexi had contributed recently and is presently pending review (don't know the JIRA# top of my head). b) Using Finite State Transducers as a dictionary type. I had opened a Jira for this and an work on it. Are you using Lucene's FSTs for this? Rest sounds good. Anything else others would like to add??? Grant, could we have a hangout the week of Oct 7 :) ?? I can't that week, but probably the following. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org dev@mahout.apache.org Sent: Friday, September 27, 2013 8:48 AM Subject: 0.9? Anyone interested in thinking about 0.9 in the early Nov. time frame? -Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0.9?
Moving closer to 1.0, removing cruft, etc. Do we have any more major features planned for 1.0? I think we said during 0.8 that we would try to follow pretty quickly w/ another release. -Grant On Sep 28, 2013, at 12:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: Sounds right in principle but perhaps a bit soon. What would define the release? Sent from my iPhone On Sep 27, 2013, at 7:48, Grant Ingersoll gsing...@apache.org wrote: Anyone interested in thinking about 0.9 in the early Nov. time frame? -Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
0.9?
Anyone interested in thinking about 0.9 in the early Nov. time frame? -Grant
word2vec
Anyone looked at: https://code.google.com/p/word2vec/ -Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Hangout on Monday
On Aug 5, 2013, at 9:30 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Mon, Aug 5, 2013 at 5:21 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Grant had setup a biweekly/weekly Google Doodle for Mahout meetups. Can you say something about how to hijack some of those? Doodle just allows you to capture when people are most available. I believe if you check the link sent earlier, you can see when they are and how people voted. Otherwise, I can dig it up. -Grant
Re: lucene.vectors tool not working
Can you provide more details on what you ran? Also, please ask on u...@mahout.apache.org in the future Thanks, Grant On Jul 31, 2013, at 9:18 PM, Swami Kevala swami.kev...@ishafoundation.org wrote: I'm using Solr 4.4 and Mahout 0.8 I'm getting the following error SEVERE: There are too many documents that do not have a term vector for text Exception in thread main java.lang.IllegalStateException: There are too many documents that do not have a term vector for text at org.apache.mahout.utils.vectors.lucene.AbstractLuceneIterator.computeNext(Abst ractLuceneIterator.java:97) I tried setting the parameter: --maxPercentErrorDocs 0.9 and I still get the same error. I have defined termvectors for my Solr 'text' field Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
[ https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724029#comment-13724029 ] Grant Ingersoll commented on MAHOUT-627: Dhruv, Any chance this can get done? Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training. - Key: MAHOUT-627 URL: https://issues.apache.org/jira/browse/MAHOUT-627 Project: Mahout Issue Type: Task Components: Classification Affects Versions: 0.4, 0.5 Reporter: Dhruv Kumar Assignee: Grant Ingersoll Labels: gsoc, gsoc2011, mahout-gsoc-11 Fix For: 0.9 Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training. Student Name: Dhruv Kumar Student E-mail: dku...@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel
Re: 0.8
entropy stuff in org.apache.mahout.math.stats.entropy If you are interested in supporting 1 or more of these algorithms, please make it known on dev@mahout.apache.org and via JIRA issues that fix and/or improve them. Please also provide supporting evidence as to their effectiveness for you in production. 1.0 PLANS Our plans as a community are to focus 0.9 on cleanup of bugs and the removal of the code mentioned above and then to follow with a 1.0 release soon thereafter, at which point the community is committing to the support of the algorithms packaged in the 1.0 for at least two minor versions after their release. In the case of removal, we will deprecate the functionality in the 1.(x+1) minor release and remove it in the 1.(x+2) release. For instance, if feature X is to be removed after the 1.2 release, it will be deprecated in 1.3 and removed in 1.4. {quote} [1] http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?revision=1501110view=markup [2] https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.8%22] From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org dev@mahout.apache.org Sent: Wednesday, July 24, 2013 7:51 AM Subject: 0.8 0.8 artifacts are pushed to the mirror location. I will send an official announcement tomorrow. In the meantime, please review the release notes at: https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8 The new features/fixes section is pretty weak. -Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Apache Mahout 0.8 Released
The Apache Mahout PMC is pleased to announce the release of Mahout 0.8. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the 3Cs), as well as the necessary infrastructure to support those implementations including, but not limited to, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and much more. The 0.8 release is mainly a clean up release in preparation for an upcoming 1.0 release, but there are several significant new features, which are highlighted below. To get started with Apache Mahout 0.8, download the release artifacts and signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven repository. In addition to the release highlights and artifacts, please pay attention to the section labelled FUTURE PLANS below for more information about upcoming releases of Mahout. As with any release, we wish to thank all of the users and contributors to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for individual credits, as there are too many to list here. GETTING STARTED In the release package, the examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory and will prompt you for more information to help you try things out. Most examples do not need a Hadoop cluster in order to run. RELEASE HIGHLIGHTS The highlights of the Apache Mahout 0.8 release include, but are not limited to the list below. For further information, see the included CHANGELOG file. - Numerous performance improvements to Vector and Matrix implementations, API's and their iterators (see also MAHOUT-1192, MAHOUT-1202) - Numerous performance improvements to the recommender implementations (see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264) - MAHOUT-1088: Support for biased item-based recommender - MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases - MAHOUT-1106: Support for SVD++ - MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.1. - MAHOUT-1154 and friends: New streaming k-means implementation that offers on-line (and fast) clustering - MAHOUT-833: Make conversion to SequenceFiles Map-Reduce, 'seqdirectory' can now be run as a MapReduce job. - MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values). - MAHOUT-884: Matrix Concat utility, presently only concatenates two matrices. - MAHOUT-1244: Upgraded to use Lucene 4.3 - MAHOUT-1187: Upgraded to CommonsLang3 - MAHOUT-916: Speedup the Mahout build by making tests run in parallel. - The usual bug fixes. See JIRA [2] for more information on the 0.8 release. A total of 218 separate JIRA issues are addressed in this release. CONTRIBUTING Mahout is always looking for contributions focused on the 3Cs. If you are interested in contributing, please see our contribution page, https://cwiki.apache.org/MAHOUT/how-to-contribute.html, on the Mahout wiki or contact us via email at dev@mahout.apache.org. FUTURE PLANS 0.9 As the project moves towards a 1.0 release, the community is working to clean up and/or remove parts of the code base that are under-supported or that underperform as well as to better focus the energy and contributions on key algorithms that are proven to scale in production and have seen wide-spread adoption. To this end, in the next release, the project is planning on removing support for the following algorithms unless there is sustained support and improvement of them before the next release. The algorithms to be removed are: - From Clustering: Dirichlet MeanShift MinHash Eigencuts - From Classification (both are sequential implementations) Winnow Perceptron - Frequent Pattern Mining - Collaborative Filtering All recommenders in org.apache.mahout.cf.taste. impl.recommender.knn SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender - Mahout Math Lanczos in favour of SSVD Hadoop entropy stuff in org.apache.mahout.math.stats.entropy If you are interested in supporting 1 or more of these algorithms, please make it known on dev@mahout.apache.org and via JIRA issues that fix and/or improve them. Please also provide supporting evidence as to their
Re: 0.8
On Jul 25, 2013, at 11:08 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: What does it mean -- remove Mahout Math? It is a high level bullet, see the items underneath. Unfortunately, they don't translate to text format very well.
[jira] [Updated] (MAHOUT-1284) DummyRecordWriter's bug with reused Writables
[ https://issues.apache.org/jira/browse/MAHOUT-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1284: Fix Version/s: (was: 0.8) (was: 0.7) 0.9 DummyRecordWriter's bug with reused Writables - Key: MAHOUT-1284 URL: https://issues.apache.org/jira/browse/MAHOUT-1284 Project: Mahout Issue Type: Bug Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Priority: Minor Labels: test Fix For: 0.9 Attachments: MAHOUT-1284.patch Original Estimate: 1h Remaining Estimate: 1h It is a recommended practice to reuse the Writable objects. DummyRecordWriter, which is used for testing in Mahout, however keeps the same Writable instance in a map: next time that the user reuses the Writable object, the internal map of DummyRecordWriter changes as well. This makes DummyRecordWriter fail for testing the MapReduce jobs that reuse the Writables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
0.8
0.8 artifacts are pushed to the mirror location. I will send an official announcement tomorrow. In the meantime, please review the release notes at: https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8 The new features/fixes section is pretty weak. -Grant
Re: [VOTE] Release Mahout 0.8
This passes. I will finish off the release either tonight or tomorrow AM. On Jul 19, 2013, at 3:06 AM, Jake Mannix jake.man...@gmail.com wrote: +1 from me, I used the jars to run some LDA (on a couple hundred million documents) on the work cluster (1.0.something small), and it worked fine. Other clustering example (with reuters) also worked as expected. On Thu, Jul 18, 2013 at 11:27 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: +1 from me. From: Sebastian Schelter s...@apache.org To: dev@mahout.apache.org Sent: Thursday, July 18, 2013 1:22 PM Subject: Re: [VOTE] Release Mahout 0.8 +1 from me, recommender stuff worked fine in my tests 2013/7/18 Grant Ingersoll gsing...@apache.org +1 from me. On Jul 16, 2013, at 4:52 PM, Grant Ingersoll gsing...@apache.org wrote: Applying a forcing function: Please vote on releasing the 0.8 artifacts at https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/ . Release notes are at https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8 [] +1 Looks good [] 0 - No opinion [] -1 Don't release Vote criteria from https://www.apache.org/dev/release.html What are the ASF requirements on approving a release? Votes on whether a package is ready to be released use majority approval -- i.e., at least three PMC members must vote affirmatively for release, and there must be more positive than negative votes. Releases may not be vetoed. Before voting +1 PMC members are required to download the signed source code package, compile it as provided, and test the resulting executable on their own platform, along with also verifying that the package meets the requirements of the ASF policy on releases. Thanks, Grant -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: [VOTE] Release Mahout 0.8
+1 from me. On Jul 16, 2013, at 4:52 PM, Grant Ingersoll gsing...@apache.org wrote: Applying a forcing function: Please vote on releasing the 0.8 artifacts at https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/. Release notes are at https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8 [] +1 Looks good [] 0 - No opinion [] -1 Don't release Vote criteria from https://www.apache.org/dev/release.html What are the ASF requirements on approving a release? Votes on whether a package is ready to be released use majority approval -- i.e., at least three PMC members must vote affirmatively for release, and there must be more positive than negative votes. Releases may not be vetoed. Before voting +1 PMC members are required to download the signed source code package, compile it as provided, and test the resulting executable on their own platform, along with also verifying that the package meets the requirements of the ASF policy on releases. Thanks, Grant
Re: mahout-distribution-0.8-src.tar.gz cannot be unpacked on Linux
If the artifacts don't work, this is a blocker. On Jul 18, 2013, at 2:27 AM, Stevo Slavić ssla...@gmail.com wrote: Hello team, Just like binary distribution couldn't be unpacked (see MAHOUT-1229https://issues.apache.org/jira/browse/MAHOUT-1229), I've just discovered that mahout-distribution-0.8-src.tar.gz also cannot be unpacked (mahout executable cannot be unpacked to bin directory, bin directory permissions are not set). Zip distribution src archive can be unpacked. Fix is trivial, equivalent to the fix for MAHOUT-1229. Shall we just fix this in 0.9 or release new 0.8 RC with this fixed? Kind regards, Stevo Slavic.
Re: mahout-distribution-0.8-src.tar.gz cannot be unpacked on Linux
On Jul 18, 2013, at 1:23 PM, Grant Ingersoll gsing...@apache.org wrote: If the artifacts don't work, this is a blocker. On 2nd thought, we could just doc that piece. On Jul 18, 2013, at 2:27 AM, Stevo Slavić ssla...@gmail.com wrote: Hello team, Just like binary distribution couldn't be unpacked (see MAHOUT-1229https://issues.apache.org/jira/browse/MAHOUT-1229), I've just discovered that mahout-distribution-0.8-src.tar.gz also cannot be unpacked (mahout executable cannot be unpacked to bin directory, bin directory permissions are not set). Zip distribution src archive can be unpacked. Fix is trivial, equivalent to the fix for MAHOUT-1229. Shall we just fix this in 0.9 or release new 0.8 RC with this fixed? Kind regards, Stevo Slavic. Grant Ingersoll | @gsingers http://www.lucidworks.com
[VOTE] Release Mahout 0.8
Applying a forcing function: Please vote on releasing the 0.8 artifacts at https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/. Release notes are at https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8 [] +1 Looks good [] 0 - No opinion [] -1 Don't release Vote criteria from https://www.apache.org/dev/release.html What are the ASF requirements on approving a release? Votes on whether a package is ready to be released use majority approval -- i.e., at least three PMC members must vote affirmatively for release, and there must be more positive than negative votes. Releases may not be vetoed. Before voting +1 PMC members are required to download the signed source code package, compile it as provided, and test the resulting executable on their own platform, along with also verifying that the package meets the requirements of the ASF policy on releases. Thanks, Grant
Re: Mahout release process
On Jul 14, 2013, at 7:27 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I'd say go for it. Of course, my preference would be that time spent on Mahout right now is focused on testing 0.8, but you are free to do as you wish. it looks good on my part. I found however that a bug was (re-?) introduced into UpperTriangular matrix( breaks row count property in certain form of constructor) which however did not seem to affect any of existing solvers. this is fixed as a part of M-1281 Do we need to respin?
Re: Mahout release process
On Jul 11, 2013, at 12:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Grant, so we have released then and can commit 0.9 issues to trunk now? or we are still frozen and waiting for final release steps? or release candidates? I think you can, but the big unknown to me is how Maven handles rollbacks if something goes wrong. I guess I can always pull the tag/branch and work off of that. because it is my understanding that after we have build 0.9 artifacts, we cannot build them again -- so we must have built final 0.9 then. If for some reason we are not happy with 0.9 artifacts we kind of have to build something like 0.9.1 but not 0.9 again... anyway i just want to know when it is ok to start pushing 0.9 things to master. I'd say go for it. Of course, my preference would be that time spent on Mahout right now is focused on testing 0.8, but you are free to do as you wish. Thank you, sir. -d On Thu, Jul 11, 2013 at 7:31 AM, Grant Ingersoll gsing...@apache.orgwrote: On Jul 10, 2013, at 5:05 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: i thought maven release:prepare changes from 0.8-SNAPSHOT to 0.8 (and eliminates snapshot dependencies). and release:perform goes from 0.8 to 0.9-SNAPSHOT. I.e. it guarantees that by the time you have 0.9-SNAPSHOT set, you also have a released 0.8 build. Correct. The release artifacts are 0.8, no SNAPSHOT, trunk is 0.9-SNAPSHOT but for some reason it is not what is happening now on trunk. On Wed, Jul 10, 2013 at 10:06 AM, Jake Mannix jake.man...@gmail.com wrote: On Wed, Jul 10, 2013 at 10:00 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: That's how the maven release plugin does it in my experience, and yes that's what I get now too. Ok, that's fine if it's intended, but it seems to put us in a little bit of a weird state. We tell our users often to build on trunk, so if they're using the current most recent release (0.7), then if they do that now, they go from 0.7 to 0.9-SNAPSHOT. Not the end of the world, but this would be avoided if we were on a release branch, right? Maybe next time, we can do that? On Wed, Jul 10, 2013 at 10:54 AM, Jake Mannix jake.man...@gmail.com wrote: So quick question: is an intentional side-effect of the current release process that when we build on trunk now, we build artifacts named e.g. mahout-examples-0.9-SNAPSHOT-job.jar ? On Wed, Jul 10, 2013 at 2:33 AM, Sean Owen sro...@gmail.com wrote: Yes you can do all of this in a branch, which would let things continue to change on HEAD. Otherwise HEAD has to be frozen. I think here there's not enough velocity of change to make freezing HEAD that big of a deal, but yes you could manage the process yourself in a branch if you wanted to. Tags are changeable in SVN. Nobody is depending on the tag until after the release is finalized, so moving them during the release or reapplying them is no big thing. The release process doesn't update Maven artifacts, even snapshots, so the process does not affect what artifacts end users use. RCs are indeed all labeled x.y but are certainly distinguished by date, timestamp. It's not a RC in the sense that it may evolve and change in response to bug fixes over weeks or months -- it's either a valid build or it isn't right now, and is released or not in a few days unless there is a critical build problem. It will only be developers that might ever distinguish several builds. You can use x.y.z for sure and I personally would be happy to see 0.8.0 used instead of 0.8. That is technically more standard Maven convention. I don't think there will be enough change / energy for point releases but it doesn't hurt to allow for the possibility. On Wed, Jul 10, 2013 at 10:11 AM, Stevo Slavić ssla...@gmail.com wrote: This is continuation of my and Grant's discussion on https://issues.apache.org/jira/browse/MAHOUT-1275 which I believe is better suited to be continued here on the dev mailing list. Apologies for my ignorance, if this discussion took place earlier in the project lifetime. There is no 0.8 branch here: http://svn.apache.org/viewvc/mahout/branches/ maven-release-plugin:prepare creates a tag only, which (in svn) although similar to branch, shouldn't be modifiable, and we need it to be modifiable if something needs to be changed for final 0.8 release, without stopping/freezing 0.9 development. Release instructions basically state that if something is wrong with RC release, to delete RC release (drop staging repo and delete tag from svn), rollback version changes on trunk (from 0.9-SNAPSHOT back to 0.8-SNAPSHOT), make a fix on trunk, and prepare/perform RC release again (same 0.8 release version). Current 0.8 RC, IMO is not a proper RC - if we need to make a change to it and release another RC, there would be no obvious distinction between the two RCs, especially to Maven builds
Re: Mahout release process
I made a branch off of 0.8, so presumably any fixes can be made off of that and then we can retag as necessary. On Jul 14, 2013, at 7:29 AM, Grant Ingersoll gsing...@apache.org wrote: On Jul 11, 2013, at 12:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Grant, so we have released then and can commit 0.9 issues to trunk now? or we are still frozen and waiting for final release steps? or release candidates? I think you can, but the big unknown to me is how Maven handles rollbacks if something goes wrong. I guess I can always pull the tag/branch and work off of that. because it is my understanding that after we have build 0.9 artifacts, we cannot build them again -- so we must have built final 0.9 then. If for some reason we are not happy with 0.9 artifacts we kind of have to build something like 0.9.1 but not 0.9 again... anyway i just want to know when it is ok to start pushing 0.9 things to master. I'd say go for it. Of course, my preference would be that time spent on Mahout right now is focused on testing 0.8, but you are free to do as you wish. Thank you, sir. -d On Thu, Jul 11, 2013 at 7:31 AM, Grant Ingersoll gsing...@apache.orgwrote: On Jul 10, 2013, at 5:05 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: i thought maven release:prepare changes from 0.8-SNAPSHOT to 0.8 (and eliminates snapshot dependencies). and release:perform goes from 0.8 to 0.9-SNAPSHOT. I.e. it guarantees that by the time you have 0.9-SNAPSHOT set, you also have a released 0.8 build. Correct. The release artifacts are 0.8, no SNAPSHOT, trunk is 0.9-SNAPSHOT but for some reason it is not what is happening now on trunk. On Wed, Jul 10, 2013 at 10:06 AM, Jake Mannix jake.man...@gmail.com wrote: On Wed, Jul 10, 2013 at 10:00 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: That's how the maven release plugin does it in my experience, and yes that's what I get now too. Ok, that's fine if it's intended, but it seems to put us in a little bit of a weird state. We tell our users often to build on trunk, so if they're using the current most recent release (0.7), then if they do that now, they go from 0.7 to 0.9-SNAPSHOT. Not the end of the world, but this would be avoided if we were on a release branch, right? Maybe next time, we can do that? On Wed, Jul 10, 2013 at 10:54 AM, Jake Mannix jake.man...@gmail.com wrote: So quick question: is an intentional side-effect of the current release process that when we build on trunk now, we build artifacts named e.g. mahout-examples-0.9-SNAPSHOT-job.jar ? On Wed, Jul 10, 2013 at 2:33 AM, Sean Owen sro...@gmail.com wrote: Yes you can do all of this in a branch, which would let things continue to change on HEAD. Otherwise HEAD has to be frozen. I think here there's not enough velocity of change to make freezing HEAD that big of a deal, but yes you could manage the process yourself in a branch if you wanted to. Tags are changeable in SVN. Nobody is depending on the tag until after the release is finalized, so moving them during the release or reapplying them is no big thing. The release process doesn't update Maven artifacts, even snapshots, so the process does not affect what artifacts end users use. RCs are indeed all labeled x.y but are certainly distinguished by date, timestamp. It's not a RC in the sense that it may evolve and change in response to bug fixes over weeks or months -- it's either a valid build or it isn't right now, and is released or not in a few days unless there is a critical build problem. It will only be developers that might ever distinguish several builds. You can use x.y.z for sure and I personally would be happy to see 0.8.0 used instead of 0.8. That is technically more standard Maven convention. I don't think there will be enough change / energy for point releases but it doesn't hurt to allow for the possibility. On Wed, Jul 10, 2013 at 10:11 AM, Stevo Slavić ssla...@gmail.com wrote: This is continuation of my and Grant's discussion on https://issues.apache.org/jira/browse/MAHOUT-1275 which I believe is better suited to be continued here on the dev mailing list. Apologies for my ignorance, if this discussion took place earlier in the project lifetime. There is no 0.8 branch here: http://svn.apache.org/viewvc/mahout/branches/ maven-release-plugin:prepare creates a tag only, which (in svn) although similar to branch, shouldn't be modifiable, and we need it to be modifiable if something needs to be changed for final 0.8 release, without stopping/freezing 0.9 development. Release instructions basically state that if something is wrong with RC release, to delete RC release (drop staging repo and delete tag from svn), rollback version changes on trunk (from 0.9-SNAPSHOT back to 0.8-SNAPSHOT), make a fix on trunk, and prepare/perform RC release again (same 0.8 release
Re: Mahout release process
.x after final release could have 0.8.1-SNAPSHOT version, for any critical support changes in future, before 0.9 release. During whole time of forging 0.8 RC and final releases on their own 0.8.x branch, 0.9-SNAPSHOT development on trunk can go on. Also, there would be no rollbacks necessary for RC releases (with exception of cases when it's really necessary, e.g. when release of some RC is incomplete/breaks because of network failure or something similar). Also tags stay non-modifiable. I noticed at least one Apache project to follow this release workflow (with staging RCs with different Maven coordinates, and promoting an RC to final release), namely on Apache HttpComponents project. I could understand current release process, if idea is to have all hands focused on the release while it's being made/tested, and also making it obvious to community (with absence of branches other than trunk) that there is no support whatsoever possible/available via minor releases, apart from changes on trunk and next major release. Kind regards, Stevo Slavić. -- -jake -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: EigenDecomposition
FWIW, the only way we are getting out of code freeze is if we actually get some feedback on the RC. It passes my tests, but I haven't heard from others much. -Grant On Jul 10, 2013, at 5:13 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: meant, after code freeze is over. On Wed, Jul 10, 2013 at 2:13 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: fixed as part of MAHOUT-1281 patch now. I will push after code freeze. On Wed, Jul 10, 2013 at 2:06 PM, Ted Dunning ted.dunn...@gmail.comwrote: Please file. Looks completely innocuous and it is good to be standard. On Wed, Jul 10, 2013 at 12:59 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Looks like Lanczos is having the same problem and need to undo some workarounds : EigenDecomposition decomp = new EigenDecomposition(triDiag); Matrix eigenVects = decomp.getV(); Vector eigenVals = decomp.getRealEigenvalues(); endTime(TimingSection.TRIDIAG_DECOMP); startTime(TimingSection.FINAL_EIGEN_CREATE); for (int row = 0; row i; row++) { Vector realEigen = null; // the eigenvectors live as columns of V, in reverse order. Weird but true. Vector ejCol = eigenVects.viewColumn(i - row - 1); int size = Math.min(ejCol.size(), state.getBasisSize()); On Wed, Jul 10, 2013 at 12:53 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: changing line 329 of EigenDecomposition.java from if (d.getQuick(j) p) { to if (d.getQuick(j) p) { makes my MAHOUT-1281 patch work. should i keep the change? (question for Ted, i guess) thanks. -D On Wed, Jul 10, 2013 at 11:59 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: It looks like values out of our ported EigenDecomposition are coming out sorted in inverse order. Shouldn't it be the other way around? Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types
[ https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13703155#comment-13703155 ] Grant Ingersoll commented on MAHOUT-1275: - [~sslavic] Yeah, Maven release does create the branch and that is the workflow I usually use as well. The main issue I have, is it seems like the Maven release goal has to rollback things if for some reason there are issues w/ the RC, but perhaps that is just our misunderstanding of how to use the Maven release goal. Please have a look at https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Release to see if our understanding of that is right. Drop some of the Release Artifact File Types Key: MAHOUT-1275 URL: https://issues.apache.org/jira/browse/MAHOUT-1275 Project: Mahout Issue Type: Task Reporter: Grant Ingersoll Assignee: Stevo Slavic Priority: Minor Fix For: 0.9 There really is no reason why we need so many release artifacts for the distribution. We run on *NIX machines. Zip and Gzip are standard tools, let's save a few bits, along with Release Manager upload times, and drop the BZ2 format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: AWS test bed
That sounds cool. I think one of the keys is making it easy to spin up and test our stuff there. On Jul 9, 2013, at 1:36 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Just got a promo code from the AWS team that will buy $1,000 of their services. On Tue, Jul 9, 2013 at 10:52 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: One of the things we chatted about last night in the hangout was how to automate this regression process. I reached out to our friends at Amazon Web Services, who are looking at how they could donate compute time so we could use a cluster as well regressing on our own hosts. We could either spin things up and run things manually or write some scripts to do it; in any caseI will keep you posted on what develops. Best Andrew Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: (Bi-)Weekly/Monthly Dev Sessions
No worries, next time! It was a good first attempt, mainly focused around testing 0.8 and getting up to speed on what needs to be tested. The next time I am available is August 5th. If others want to meet before then, please do, otherwise, I will send out a reminder closer to the 5th. On Jul 9, 2013, at 12:39 PM, Peng Cheng pc...@uowmail.edu.au wrote: Sorry I missed the meeting, I really want to listen to your discussion but yesterday a thunderstorm cut off my electricity. On 13-07-08 08:29 PM, Andrew Musselman wrote: I'm getting an error when I build after doing svn up: $ mvn package [INFO] Scanning for projects... [ERROR] The build could not read 1 project - [Help 1] [ERROR] [ERROR] The project (/home/akm/mahout/pom.xml) has 1 error [ERROR] Non-readable POM /home/akm/mahout/pom.xml: no more data available - expected end tag /project to close start tag project from line 2, parser stopped on END_TAG seen .../reporting\n/project\n... @1030:1 But there's a /project tag at the end of that.. On Mon, Jul 8, 2013 at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote: Hmm, seems like that old link doesn't work. Here's a new one: https://plus.google.com/hangouts/_/899b63ca1b3864c749886348cdddfcd80d00bb0b?hl=en -Grant On Jul 7, 2013, at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote: How about tomorrow (Monday) night at 8:30 pm EDT? Anyone who wants to join, can browse to https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en If for some reason that doesn't work, ping me on IRC (gsingers) in the #mahout channel on Freenode. Agenda: 0.8 Release Testing -Grant On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Is today's Hangout happening? On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org wrote: Hi, One of the things we kicked around at Buzzwords was having a weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with good success, I believe). Since we are so spread out, I thought I would throw out a Doodle (scheduling tool for those unfamiliar) to see what times work best for the majority of people interested in such a thing. Anyone is free to participate, but this is not a Q and A session, but is instead focused on writing code, fixing bugs, triaging JIRA, releasing, etc. If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time Zone since I did the poll!) I just grabbed a sampling of hours throughout the day. I also picked 1 week as being representative of this being on a repeating schedule. If none of the times work for you, but you are still interested, please respond here. I would imagine we would meet for 1-2 hours. Also, please reply with the frequency at which you would like to meet: [] Weekly [] Bi-weekly (every 2 weeks) [] Monthly My vote is every two weeks. -Grant -- Thanks, Pradeep Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0.8 progress
Any feedback yet on the RCs? -Grant On Jul 8, 2013, at 1:51 PM, Peng Cheng pc...@uowmail.edu.au wrote: Hi Sebastian, I'm sorry for the entirely noobish questions: where can I download the judging.txt ground truth set? (netflix is pulling it off everywhere, so far I can only get the legacy trainingSet and qualifying.txt) and how do I inject the ParallelAlsFactorizationJob into a common recommender class? I was trying to reproduce your result (I own a small cluster), but don't even know where to start. The only related thing i found in mahout-example is a format converter. Thanks a lot if you can give me a hint. - Yours Peng On 13-07-01 01:24 AM, Sebastian Schelter wrote: I successfully ran the ALS and cooccurrence-based recommenders on the Netflix dataset on a 26 machine cluster using Hadoop 1.0.4. --sebastian On 28.06.2013 21:31, Jake Mannix wrote: I can run LDA on Twitter's cluster, on both reuters and some real data, as well as LR/SGD. On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll gsing...@apache.orgwrote: We really should setup a VM that we can run a couple of nodes (perhaps at ASF?) on that we can share w/ everyone that makes it easy to test our stuff on Hadoop for the specific version that we ship. On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote: Can someone (if you have time and experience). Write a small shim to run all examples one after the other on a cluster and write up instructions on how to do it.? Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org wrote: Its crucial that we retest everything on a real cluster before the release. I will do this for the recommenders code next week. --sebastian Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org: I should have time next week to do the release, if we can get these knocked out. If not next week, the following. On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: 1. Could someone look at Mahout-1257? There is a patch that's been submitted but I am not sure if this has been superseded by Sean's against Mahout-1239. 2. Stevo, I am for fixing the findbugs excludes as part of 0.8 release, I see that the number of warnings has gone up over the last few builds. 3. I am more concerned about the cause of the mysterious cosmic rays that randomly fail unit tests (since we have moved to running parallel tests). I see that happening on my local repository too. From: Stevo Slavić ssla...@gmail.com To: dev@mahout.apache.org Sent: Friday, June 28, 2013 3:21 AM Subject: Re: 0.8 progress Well done team! Build is unstable, oscillates, IMO regardless of changes made. Judging from logs I suspect that some of the Jenkins nodes are not configured well, /tmp directory security related issues, and file size constraints. Could be also issue with our tests. Javadoc was reported earlier not to be OK (not all modules in aggregated javadoc), and code quality reports are not working OK, e.g. findbugs doesn't respect excludes - plan to work on this during weekend. Do we want to fix these before or after 0.8 release? Kind regards, Stevo Slavić. On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com wrote: All Done Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com wrote: I sent the comments. The code is good. But without the matrix/vector input we cant ship it in the release. Hope Yiqun and Da Zhang can make those changes quickly. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll gsing...@apache.org wrote: I see 1 issue left: MAHOUT-1214. It is assigned to Robin. Any chance we can finish this up this week? -Grant On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Finally got to finishing up M-833, the changes can be reviewed at https://reviews.apache.org/r/11774/diff/3/. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Tuesday, June 11, 2013 10:09 AM Subject: Re: 0.8 progress I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push
[jira] [Created] (MAHOUT-1275) Drop some of the Release Artifact File Types
Grant Ingersoll created MAHOUT-1275: --- Summary: Drop some of the Release Artifact File Types Key: MAHOUT-1275 URL: https://issues.apache.org/jira/browse/MAHOUT-1275 Project: Mahout Issue Type: Task Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.9 There really is no reason why we need so many release artifacts for the distribution. We run on *NIX machines. Zip and Gzip are standard tools, let's save a few bits, along with Release Manager upload times, and drop the BZ2 format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types
[ https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702370#comment-13702370 ] Grant Ingersoll commented on MAHOUT-1275: - Stevo, just FYI, please don't commit anything right now, as we are under code freeze until 0.8 is out (unless you know how to deal w/ this in Maven release plugin) Drop some of the Release Artifact File Types Key: MAHOUT-1275 URL: https://issues.apache.org/jira/browse/MAHOUT-1275 Project: Mahout Issue Type: Task Reporter: Grant Ingersoll Assignee: Stevo Slavic Priority: Minor Fix For: 0.9 There really is no reason why we need so many release artifacts for the distribution. We run on *NIX machines. Zip and Gzip are standard tools, let's save a few bits, along with Release Manager upload times, and drop the BZ2 format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types
[ https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702458#comment-13702458 ] Grant Ingersoll commented on MAHOUT-1275: - [~sslavic] Please revert this. We are under code freeze right now on trunk. Drop some of the Release Artifact File Types Key: MAHOUT-1275 URL: https://issues.apache.org/jira/browse/MAHOUT-1275 Project: Mahout Issue Type: Task Reporter: Grant Ingersoll Assignee: Stevo Slavic Priority: Minor Fix For: 0.9 There really is no reason why we need so many release artifacts for the distribution. We run on *NIX machines. Zip and Gzip are standard tools, let's save a few bits, along with Release Manager upload times, and drop the BZ2 format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: (Bi-)Weekly/Monthly Dev Sessions
Hmm, seems like that old link doesn't work. Here's a new one: https://plus.google.com/hangouts/_/899b63ca1b3864c749886348cdddfcd80d00bb0b?hl=en -Grant On Jul 7, 2013, at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote: How about tomorrow (Monday) night at 8:30 pm EDT? Anyone who wants to join, can browse to https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en If for some reason that doesn't work, ping me on IRC (gsingers) in the #mahout channel on Freenode. Agenda: 0.8 Release Testing -Grant On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Is today's Hangout happening? On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org wrote: Hi, One of the things we kicked around at Buzzwords was having a weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with good success, I believe). Since we are so spread out, I thought I would throw out a Doodle (scheduling tool for those unfamiliar) to see what times work best for the majority of people interested in such a thing. Anyone is free to participate, but this is not a Q and A session, but is instead focused on writing code, fixing bugs, triaging JIRA, releasing, etc. If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time Zone since I did the poll!) I just grabbed a sampling of hours throughout the day. I also picked 1 week as being representative of this being on a repeating schedule. If none of the times work for you, but you are still interested, please respond here. I would imagine we would meet for 1-2 hours. Also, please reply with the frequency at which you would like to meet: [] Weekly [] Bi-weekly (every 2 weeks) [] Monthly My vote is every two weeks. -Grant -- Thanks, Pradeep Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Jenkins build is back to normal : Mahout-Quality #2128
On Jul 6, 2013, at 4:38 PM, Stevo Slavić ssla...@gmail.com wrote: What did the trick (as of r1500216) for last two builds to be successful was serializing unit tests. At least some of them it seems are not designed to run in parallel (they very likely share some state), and they were running in parallel (1.5 per CPU core of Jenkins node on which build is running), causing each other to fail randomly. Now it's all sequential. So, we undid the parallel builds? Do you have a sense of the ones that were causing problems? -G
Re: Code Freeze for 0.8
Working on the release now. If anyone wants to join in, I'm on IRC as well. -Grant On Jul 5, 2013, at 12:40 PM, Sebastian Schelter s...@apache.org wrote: +1 On 05.07.2013 18:06, Jake Mannix wrote: +1 On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunning ted.dunn...@gmail.com wrote: +1 On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: +1 From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org dev@mahout.apache.org Sent: Friday, July 5, 2013 10:36 AM Subject: Code Freeze for 0.8 I know it's short notice, but I'd like to suggest a code freeze for 0.8 today or tomorrow and I will do a 0.8 RC on Sunday. Based on JIRA, etc., it looks like this should be fine, but let me know if there are any objections. Thanks, Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: (Bi-)Weekly/Monthly Dev Sessions
How about tomorrow (Monday) night at 8:30 pm EDT? Anyone who wants to join, can browse to https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en If for some reason that doesn't work, ping me on IRC (gsingers) in the #mahout channel on Freenode. Agenda: 0.8 Release Testing -Grant On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Is today's Hangout happening? On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org wrote: Hi, One of the things we kicked around at Buzzwords was having a weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with good success, I believe). Since we are so spread out, I thought I would throw out a Doodle (scheduling tool for those unfamiliar) to see what times work best for the majority of people interested in such a thing. Anyone is free to participate, but this is not a Q and A session, but is instead focused on writing code, fixing bugs, triaging JIRA, releasing, etc. If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time Zone since I did the poll!) I just grabbed a sampling of hours throughout the day. I also picked 1 week as being representative of this being on a repeating schedule. If none of the times work for you, but you are still interested, please respond here. I would imagine we would meet for 1-2 hours. Also, please reply with the frequency at which you would like to meet: [] Weekly [] Bi-weekly (every 2 weeks) [] Monthly My vote is every two weeks. -Grant -- Thanks, Pradeep Grant Ingersoll | @gsingers http://www.lucidworks.com
0.8 Release Notes
Please add/edit/delete/extend: https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8 Artifacts for the RC are uploading as I type. -Grant
Code Freeze for 0.8
I know it's short notice, but I'd like to suggest a code freeze for 0.8 today or tomorrow and I will do a 0.8 RC on Sunday. Based on JIRA, etc., it looks like this should be fine, but let me know if there are any objections. Thanks, Grant
Re: In-Mapper combiner design pattern
Just coming back to this... On Jun 12, 2013, at 5:38 PM, DB Tsai dbt...@dbtsai.com wrote: Hi, For scalable SVM, since our codebase is quite different from mahout, it may take some time to refactorize it to work in mahout. Note, the community may be able to help, here, if you put up a patch, then others likely will jump on and help. Your call, of course. Food for thought, Grant
Re: 0.8 progress
I should have time next week to do the release, if we can get these knocked out. If not next week, the following. On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: 1. Could someone look at Mahout-1257? There is a patch that's been submitted but I am not sure if this has been superseded by Sean's against Mahout-1239. 2. Stevo, I am for fixing the findbugs excludes as part of 0.8 release, I see that the number of warnings has gone up over the last few builds. 3. I am more concerned about the cause of the mysterious cosmic rays that randomly fail unit tests (since we have moved to running parallel tests). I see that happening on my local repository too. From: Stevo Slavić ssla...@gmail.com To: dev@mahout.apache.org Sent: Friday, June 28, 2013 3:21 AM Subject: Re: 0.8 progress Well done team! Build is unstable, oscillates, IMO regardless of changes made. Judging from logs I suspect that some of the Jenkins nodes are not configured well, /tmp directory security related issues, and file size constraints. Could be also issue with our tests. Javadoc was reported earlier not to be OK (not all modules in aggregated javadoc), and code quality reports are not working OK, e.g. findbugs doesn't respect excludes - plan to work on this during weekend. Do we want to fix these before or after 0.8 release? Kind regards, Stevo Slavić. On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com wrote: All Done Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com wrote: I sent the comments. The code is good. But without the matrix/vector input we cant ship it in the release. Hope Yiqun and Da Zhang can make those changes quickly. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll gsing...@apache.org wrote: I see 1 issue left: MAHOUT-1214. It is assigned to Robin. Any chance we can finish this up this week? -Grant On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Finally got to finishing up M-833, the changes can be reviewed at https://reviews.apache.org/r/11774/diff/3/. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Tuesday, June 11, 2013 10:09 AM Subject: Re: 0.8 progress I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0.8 progress
We really should setup a VM that we can run a couple of nodes (perhaps at ASF?) on that we can share w/ everyone that makes it easy to test our stuff on Hadoop for the specific version that we ship. On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote: Can someone (if you have time and experience). Write a small shim to run all examples one after the other on a cluster and write up instructions on how to do it.? Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org wrote: Its crucial that we retest everything on a real cluster before the release. I will do this for the recommenders code next week. --sebastian Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org: I should have time next week to do the release, if we can get these knocked out. If not next week, the following. On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: 1. Could someone look at Mahout-1257? There is a patch that's been submitted but I am not sure if this has been superseded by Sean's against Mahout-1239. 2. Stevo, I am for fixing the findbugs excludes as part of 0.8 release, I see that the number of warnings has gone up over the last few builds. 3. I am more concerned about the cause of the mysterious cosmic rays that randomly fail unit tests (since we have moved to running parallel tests). I see that happening on my local repository too. From: Stevo Slavić ssla...@gmail.com To: dev@mahout.apache.org Sent: Friday, June 28, 2013 3:21 AM Subject: Re: 0.8 progress Well done team! Build is unstable, oscillates, IMO regardless of changes made. Judging from logs I suspect that some of the Jenkins nodes are not configured well, /tmp directory security related issues, and file size constraints. Could be also issue with our tests. Javadoc was reported earlier not to be OK (not all modules in aggregated javadoc), and code quality reports are not working OK, e.g. findbugs doesn't respect excludes - plan to work on this during weekend. Do we want to fix these before or after 0.8 release? Kind regards, Stevo Slavić. On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com wrote: All Done Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com wrote: I sent the comments. The code is good. But without the matrix/vector input we cant ship it in the release. Hope Yiqun and Da Zhang can make those changes quickly. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll gsing...@apache.org wrote: I see 1 issue left: MAHOUT-1214. It is assigned to Robin. Any chance we can finish this up this week? -Grant On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Finally got to finishing up M-833, the changes can be reviewed at https://reviews.apache.org/r/11774/diff/3/. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Tuesday, June 11, 2013 10:09 AM Subject: Re: 0.8 progress I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691954#comment-13691954 ] Grant Ingersoll commented on MAHOUT-1214: - Hi, Any progress on this? It is the last open issue for 0.8. Thanks, Grant Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: (Bi-)Weekly/Monthly Dev Sessions
I'd really like to, but had a trip come up. If possible, can we push for one week? Otherwise, if others want to go forward, I can try to set things up and share it w/ others. On Jun 24, 2013, at 6:35 PM, Bhaskar Mookerji mooke...@spin-one.org wrote: Hi! Is the Google hangouts dev session tomorrow/Tuesday still happening? Lurkingly, Buro Mookerji On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote: It seems to be that 6 pm ET is the consensus time for the majority of people, although my having screwed up the poll didn't help. Bi-weekly is the other consensus. It also looks like Tuesday or Thursday are the preferred dates. I can't make next week, so I'm going to propose we kick off on Tuesday, June 25 at 6 pm. That will give us time to dry-run the Google Hangouts, etc. Again, just to be clear, the goal here is to work on the development of Mahout, not to answer questions about how to run Mahout (we could do that separately if there is a desire.) I'll send out a reminder as we get closer. -Grant On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: I am from Northern Virginia, how many of us here are from the Washington DC Metro area? From: Jake Mannix jake.man...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Wednesday, June 12, 2013 1:56 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon when I get back from europe at the end of the summer! On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Bi-weekly is good for me; I'm in Seattle and just filled out the poll. Great idea! On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: +1, am in Seattle as well and would love to attend and be involved. Sent from my iPhone On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com wrote: Good idea on recurring meetings. Im very interested in participating. Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8. An agenda for the meetings ahead of time will help us get the most of our time at the meetings. Thanks. On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote: On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote: Angel and Suneel, you may want to re-fill out the new doodle. FYI, this week won't be representative of my schedule; I'm in the last few weeks of a job at ORNL where I travel every weekend. Normally I'll have more flexibility than just 6pm on weeknights. Yeah, Doodle makes you pick dates, but I just want it to be representative a week long period of time and not tied to a specific set of dates. So, just put in what your ideal times are in general and ignore the fact that it is set to next week. On 6/12/13 8:26 AM, Grant Ingersoll wrote: On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv Grant Ingersoll | @gsingers http://www.lucidworks.com -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272
Can someone w/ more Hadoop experience look at this? We are getting: java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) AFAICT, we are using the new APIs, but this seems to think it should be the old APIs. Note, this is an intermittent issue. Sometimes it goes through just fine. Locally, it passes for me. Note, this could also be related to the Parallel tests stuff. -Grant On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec FAILURE! testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest) Time elapsed: 1.268 sec FAILURE! org.junit.ComparisonFailure: expected:TEST/subdir/[mail-messages].gz/u...@example.com but was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com at org.junit.Assert.assertEquals(Assert.java:115) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108) Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272
Never mind the noise here, I misread this! Still, we have some error going on w/ random failures. On Jun 24, 2013, at 8:33 PM, Grant Ingersoll gsing...@apache.org wrote: Can someone w/ more Hadoop experience look at this? We are getting: java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) AFAICT, we are using the new APIs, but this seems to think it should be the old APIs. Note, this is an intermittent issue. Sometimes it goes through just fine. Locally, it passes for me. Note, this could also be related to the Parallel tests stuff. -Grant On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec FAILURE! testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest) Time elapsed: 1.268 sec FAILURE! org.junit.ComparisonFailure: expected:TEST/subdir/[mail-messages].gz/u...@example.com but was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com at org.junit.Assert.assertEquals(Assert.java:115) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108) Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0.8 progress
I see 1 issue left: MAHOUT-1214. It is assigned to Robin. Any chance we can finish this up this week? -Grant On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Finally got to finishing up M-833, the changes can be reviewed at https://reviews.apache.org/r/11774/diff/3/. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Tuesday, June 11, 2013 10:09 AM Subject: Re: 0.8 progress I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0.8 progress
How's progress? On Jun 12, 2013, at 8:46 PM, Grant Ingersoll gsing...@apache.org wrote: Fine by me. On Jun 12, 2013, at 6:12 PM, Robin Anil robin.a...@gmail.com wrote: +1 for monday. I would like this time to test MIA clustering code for the new version. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Wed, Jun 12, 2013 at 4:13 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: I am in the same boat as Dan in finishing up M-833, just not finding the time. I should have time on the weekend to wrap this up. Grant, could we have the release on Monday? From: Dan Filimon dangeorge.fili...@gmail.com To: Mahout-Dev dev@mahout.apache.org Sent: Wednesday, June 12, 2013 5:09 PM Subject: Re: 0.8 progress It turns out that my initial estimate of the time it takes to finish these issues was overly optimistic. I'm squashed between work and writing my thesis and unforeseen merging issues. So, I hate to say this, but could we please postpone this release till Monday? On Wed, Jun 12, 2013 at 1:11 PM, Grant Ingersoll gsing...@apache.org wrote: Sounds good. On Jun 11, 2013, at 4:36 PM, Dan Filimon dangeorge.fili...@gmail.com wrote: Sorry to rain on everyone's party, but I opened a few more issues I need to take of before 0.8 final that I had forgotten about. M-1253 to M-1256. I have code for all of these (that I tested, incidentally, that's the code I used for the experiments in the talk :), just need to merge it in and I wanted to have issues to mark as done to keep track of things. Should not take long and I should be done by Thursday. Also, would anyone like to review the code on ReviewBoard? :) On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll gsing...@apache.org wrote: I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: (Bi-)Weekly/Monthly Dev Sessions
It seems to be that 6 pm ET is the consensus time for the majority of people, although my having screwed up the poll didn't help. Bi-weekly is the other consensus. It also looks like Tuesday or Thursday are the preferred dates. I can't make next week, so I'm going to propose we kick off on Tuesday, June 25 at 6 pm. That will give us time to dry-run the Google Hangouts, etc. Again, just to be clear, the goal here is to work on the development of Mahout, not to answer questions about how to run Mahout (we could do that separately if there is a desire.) I'll send out a reminder as we get closer. -Grant On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: I am from Northern Virginia, how many of us here are from the Washington DC Metro area? From: Jake Mannix jake.man...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Wednesday, June 12, 2013 1:56 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon when I get back from europe at the end of the summer! On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Bi-weekly is good for me; I'm in Seattle and just filled out the poll. Great idea! On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: +1, am in Seattle as well and would love to attend and be involved. Sent from my iPhone On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com wrote: Good idea on recurring meetings. Im very interested in participating. Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8. An agenda for the meetings ahead of time will help us get the most of our time at the meetings. Thanks. On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote: On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote: Angel and Suneel, you may want to re-fill out the new doodle. FYI, this week won't be representative of my schedule; I'm in the last few weeks of a job at ORNL where I travel every weekend. Normally I'll have more flexibility than just 6pm on weeknights. Yeah, Doodle makes you pick dates, but I just want it to be representative a week long period of time and not tied to a specific set of dates. So, just put in what your ideal times are in general and ignore the fact that it is set to next week. On 6/12/13 8:26 AM, Grant Ingersoll wrote: On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv Grant Ingersoll | @gsingers http://www.lucidworks.com -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682108#comment-13682108 ] Grant Ingersoll commented on MAHOUT-1214: - bq. But @Grant suggest we supply the patch of v0.7 first. Yes, I was working under the assumption that an old patch is better than no patch. A patch against HEAD is even better. I think we have a few more days, so against HEAD would be great. Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility
[ https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682325#comment-13682325 ] Grant Ingersoll commented on MAHOUT-944: [~smarthi], the error only seems to happen when running all the tests and it seems to be intermittent. It almost looks like some type of classpath issue. LuceneIndexToSequenceFiles (lucene2seq) utility --- Key: MAHOUT-944 URL: https://issues.apache.org/jira/browse/MAHOUT-944 Project: Mahout Issue Type: New Feature Components: Integration Affects Versions: 0.5 Reporter: Frank Scholten Assignee: Grant Ingersoll Priority: Minor Fix For: 0.8 Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch Here is a lucene2seq tool I used in a project. It creates sequence files based on the stored fields of a lucene index. The output from this tool can be then fed into seq2sparse and from there you can do text clustering. Comes with Java bean configuration. Let me know what you think. Some CLI code can be added later on. I used this for a small-scale project +- 100.000 docs. Is a MR version useful or is that overkill? See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments from Simon Willnauer (Thanks Simon!) or the attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 progress
Sounds good. On Jun 11, 2013, at 4:36 PM, Dan Filimon dangeorge.fili...@gmail.com wrote: Sorry to rain on everyone's party, but I opened a few more issues I need to take of before 0.8 final that I had forgotten about. M-1253 to M-1256. I have code for all of these (that I tested, incidentally, that's the code I used for the experiments in the talk :), just need to merge it in and I wanted to have issues to mark as done to keep track of things. Should not take long and I should be done by Thursday. Also, would anyone like to review the code on ReviewBoard? :) On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll gsing...@apache.orgwrote: I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
(Bi-)Weekly/Monthly Dev Sessions
Hi, One of the things we kicked around at Buzzwords was having a weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with good success, I believe). Since we are so spread out, I thought I would throw out a Doodle (scheduling tool for those unfamiliar) to see what times work best for the majority of people interested in such a thing. Anyone is free to participate, but this is not a Q and A session, but is instead focused on writing code, fixing bugs, triaging JIRA, releasing, etc. If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time Zone since I did the poll!) I just grabbed a sampling of hours throughout the day. I also picked 1 week as being representative of this being on a repeating schedule. If none of the times work for you, but you are still interested, please respond here. I would imagine we would meet for 1-2 hours. Also, please reply with the frequency at which you would like to meet: [] Weekly [] Bi-weekly (every 2 weeks) [] Monthly My vote is every two weeks. -Grant
Re: (Bi-)Weekly/Monthly Dev Sessions
On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv
[jira] [Commented] (MAHOUT-833) Make conversion to sequence files map-reduce
[ https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681206#comment-13681206 ] Grant Ingersoll commented on MAHOUT-833: The patch seems to be missing the WholeFileRecordReader. Make conversion to sequence files map-reduce Key: MAHOUT-833 URL: https://issues.apache.org/jira/browse/MAHOUT-833 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.7 Reporter: Grant Ingersoll Assignee: Suneel Marthi Labels: MAHOUT_INTRO_CONTRIBUTE Fix For: 0.8 Attachments: MAHOUT-833-final.patch, MAHOUT-833.patch, MAHOUT-833.patch Given input that is on HDFS, the SequenceFilesFrom.java classes should be able to do their work in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: (Bi-)Weekly/Monthly Dev Sessions
On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote: Angel and Suneel, you may want to re-fill out the new doodle. FYI, this week won't be representative of my schedule; I'm in the last few weeks of a job at ORNL where I travel every weekend. Normally I'll have more flexibility than just 6pm on weeknights. Yeah, Doodle makes you pick dates, but I just want it to be representative a week long period of time and not tied to a specific set of dates. So, just put in what your ideal times are in general and ignore the fact that it is set to next week. On 6/12/13 8:26 AM, Grant Ingersoll wrote: On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: In-Mapper combiner design pattern
Hi DB, This all sounds rather interesting. I see a number of places where we use combiners, so perhaps focus on those first? Also, any thoughts on when the scalable SVM would be ready? We are trying to get 1.0 out in the next few months and I personally think it would be good to have SVM in. -Grant On Jun 11, 2013, at 8:20 PM, DB Tsai dbt...@dbtsai.com wrote: Hi, Recently we started to use the in-mapper combiner design patterns in our hadoop based algorithms at Alpine Data Labs; those algorithms include variable selection using info gain, decision tree, naive bayes model and SVM, and we found that we can have 20~40% performance speedup without doing too much work. The whole idea is really simple, just use a in-mapper LRU cache to combine the result first instead of using combiner directly. If the cache is full, just emit the result to combiner or reducer. The detail is discussed in Data-Intensive Text Processing with MapReduce (http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf) by Jimmy Lin and Chris Dyer at University of Maryland, College Park. We would like to contribute the api to mahout, and work closer with open source community. I'm now working on random forest using information gain, and we have the plan to contribute to mahout community. We also have a scalable kernel SVM implementation which intends to contribute to mahout as well. We just presented a talk about our SVM in SF machine learning meetup with great feedback, see http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192a=uc1_te_af=event The api is pretty simple, just change context.write to combiner.write, and remember to flush the cache in the clean up method. This is the example of implementing hadoop classical word count using in-mapper combiner, https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java , and all we need to do is just change from context.write to combiner.write. The test code for this example is in https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java This is the actually implementation of in-mapper combiner using LRU cache, https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java and this implementation is well tested. https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java I'm wondering what is the best candidate in mahout to use this kind of in-mapper combiner now to demonstrate this idea works, and I'll focus on that particular use case, and do benchmark. Thanks. Sincerely, DB Tsai --- Web: http://www.dbtsai.com Phone : +1-650-383-8392 Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility
[ https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681744#comment-13681744 ] Grant Ingersoll commented on MAHOUT-944: Suneel, weird. I didn't see that before. We are using the new APIs, AFAICT, so not sure what is going on. So tired of the stupidity of the dual Map/Reduce APIs in Hadoop. LuceneIndexToSequenceFiles (lucene2seq) utility --- Key: MAHOUT-944 URL: https://issues.apache.org/jira/browse/MAHOUT-944 Project: Mahout Issue Type: New Feature Components: Integration Affects Versions: 0.5 Reporter: Frank Scholten Assignee: Grant Ingersoll Priority: Minor Fix For: 0.8 Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch Here is a lucene2seq tool I used in a project. It creates sequence files based on the stored fields of a lucene index. The output from this tool can be then fed into seq2sparse and from there you can do text clustering. Comes with Java bean configuration. Let me know what you think. Some CLI code can be added later on. I used this for a small-scale project +- 100.000 docs. Is a MR version useful or is that overkill? See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments from Simon Willnauer (Thanks Simon!) or the attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 progress
Fine by me. On Jun 12, 2013, at 6:12 PM, Robin Anil robin.a...@gmail.com wrote: +1 for monday. I would like this time to test MIA clustering code for the new version. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Wed, Jun 12, 2013 at 4:13 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: I am in the same boat as Dan in finishing up M-833, just not finding the time. I should have time on the weekend to wrap this up. Grant, could we have the release on Monday? From: Dan Filimon dangeorge.fili...@gmail.com To: Mahout-Dev dev@mahout.apache.org Sent: Wednesday, June 12, 2013 5:09 PM Subject: Re: 0.8 progress It turns out that my initial estimate of the time it takes to finish these issues was overly optimistic. I'm squashed between work and writing my thesis and unforeseen merging issues. So, I hate to say this, but could we please postpone this release till Monday? On Wed, Jun 12, 2013 at 1:11 PM, Grant Ingersoll gsing...@apache.org wrote: Sounds good. On Jun 11, 2013, at 4:36 PM, Dan Filimon dangeorge.fili...@gmail.com wrote: Sorry to rain on everyone's party, but I opened a few more issues I need to take of before 0.8 final that I had forgotten about. M-1253 to M-1256. I have code for all of these (that I tested, incidentally, that's the code I used for the experiments in the talk :), just need to merge it in and I wanted to have issues to mark as done to keep track of things. Should not take long and I should be done by Thursday. Also, would anyone like to review the code on ReviewBoard? :) On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll gsing...@apache.org wrote: I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Updated] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
[ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1030: Fix Version/s: (was: 0.8) 1.0 I'm going to push this. I know that for 0.9 we are looking at reworking the way we handle vectors and their associated properties (i.e. get rid of NamedVector, etc.) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable Key: MAHOUT-1030 URL: https://issues.apache.org/jira/browse/MAHOUT-1030 Project: Mahout Issue Type: Bug Components: Clustering, Integration Affects Versions: 0.7 Reporter: Jeff Eastman Assignee: Suneel Marthi Fix For: 1.0 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results. On 6/8/12 12:21 PM, Jeff Eastman wrote: That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight. On 6/7/12 1:00 PM, Pat Ferrel wrote: It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid? I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large. Am I missing something? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680392#comment-13680392 ] Grant Ingersoll commented on MAHOUT-1214: - Any update on this for applying against trunk/0.8? Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: matrix_1, matrix_2, SpectralKMeans.patch The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 progress
I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Updated] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
[ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1030: Fix Version/s: 0.9 Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable Key: MAHOUT-1030 URL: https://issues.apache.org/jira/browse/MAHOUT-1030 Project: Mahout Issue Type: Bug Components: Clustering, Integration Affects Versions: 0.7 Reporter: Jeff Eastman Assignee: Suneel Marthi Fix For: 1.0, 0.9 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results. On 6/8/12 12:21 PM, Jeff Eastman wrote: That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight. On 6/7/12 1:00 PM, Pat Ferrel wrote: It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid? I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large. Am I missing something? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos
[ https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1233. - Resolution: Incomplete Please reopen if you have a repeatable test case, as I am not sure there is an issue here. Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos - Key: MAHOUT-1233 URL: https://issues.apache.org/jira/browse/MAHOUT-1233 Project: Mahout Issue Type: Question Components: Clustering Affects Versions: 0.7, 0.8 Reporter: yannis ats Assignee: yannis ats Priority: Minor Fix For: 0.8 I am trying to process a dataset and i do it in two ways. Firstly i give it as a single chunk(all the dataset) and secondly as many smaller chunks in order to increase the throughput of my machine. The problem is that when i perform the single chunk computation the results are fine and by fine i mean that if i have in the input 1000 vectors i get in the output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy kmeans). However when i split the dataset in order to speed up the computations then strange phenomena occur. For instance the same dataset that contains 1000 vectors and is split in for example 10 files then in the output i will obtain more vector ids(w.g 1100 vectorids with their corresponding clusterids). The question is, am i doing something wrong in the process? Is there a problem in clusterdump and seqdumper when the input is in many files? I have observed when mahout is performing the computations that in the screen says that processed the correct number of vectors. Am i missing something? I use as input the transformed to mvc weka vectors. I have tried this in v0.7 and the v0.8 snapshot. Thank you in advance for your time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Random Errors
That was the whole stack trace, unfortunately. On Jun 10, 2013, at 2:35 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: Grant, top of the stack trace is not sufficient to tell what was the offending thread. Copy-paste the entire stack, including nested exceptions. The console will also contain a full stack trace information at the moment the test framework detected a thread leak. It should be easy to tell what isn't cleaned up properly. Dawid
[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix
[ https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679817#comment-13679817 ] Grant Ingersoll commented on MAHOUT-1147: - Jake, are you up to date? I fixed a bunch of things related to cluster-reuters. Also, do you have HADOOP-HOME set? Or MAHOUT-LOCAL? CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix --- Key: MAHOUT-1147 URL: https://issues.apache.org/jira/browse/MAHOUT-1147 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.7 Environment: Eclipse IDE Java code base CVB0Driver Class setModelPaths(Job job, Path modelPath) - method Reporter: Jack Pay Assignee: Jake Mannix Labels: bug, cvb, fix, suggestion Fix For: 0.8 Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch Original Estimate: 24h Remaining Estimate: 24h Problem: When training doc/topic model no paths for the term/topic model found (outputs null). These paths are set using setModelPaths in CVB0Driver. Reason for Problem: Variety of Job instances call this method. The Job is passed to the method instead of the Configuration object given to the Job. The configuration is retrieved from the Job instance itself. I believe that this Configuration instance is a clone of the original. This is a problem as the variable MODEL_PATHS is set on the clone which is then discarded when the given Job is complete. The original Configuration has no MODEL_PATHS String set and therefore returns null. The code stipulates that if it cannot find a model to use a new random matrix. This happens every time as MODEL_PATHS is not set for the Configuration instance used. Solution: Do not pass the Job to the setModels method, but pass the Configuration instance passed into the method which created the Job. i.e. change from: setModelPaths(Job job, Path modelPath) to: setModelPaths(Configuration conf, Path modelPath) And change all calling methods accordingly (obviously). So far what little testing I have done appears to solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix
[ https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679855#comment-13679855 ] Grant Ingersoll commented on MAHOUT-1147: - Hmm, I tested k-means cluster-reuters.sh last night on Hadoop single node and it worked fine. I added a step to copy the reuters-out up to HDFS. Let me make sure I pushed (see MAHOUT-1247) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix --- Key: MAHOUT-1147 URL: https://issues.apache.org/jira/browse/MAHOUT-1147 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.7 Environment: Eclipse IDE Java code base CVB0Driver Class setModelPaths(Job job, Path modelPath) - method Reporter: Jack Pay Assignee: Jake Mannix Labels: bug, cvb, fix, suggestion Fix For: 0.8 Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch Original Estimate: 24h Remaining Estimate: 24h Problem: When training doc/topic model no paths for the term/topic model found (outputs null). These paths are set using setModelPaths in CVB0Driver. Reason for Problem: Variety of Job instances call this method. The Job is passed to the method instead of the Configuration object given to the Job. The configuration is retrieved from the Job instance itself. I believe that this Configuration instance is a clone of the original. This is a problem as the variable MODEL_PATHS is set on the clone which is then discarded when the given Job is complete. The original Configuration has no MODEL_PATHS String set and therefore returns null. The code stipulates that if it cannot find a model to use a new random matrix. This happens every time as MODEL_PATHS is not set for the Configuration instance used. Solution: Do not pass the Job to the setModels method, but pass the Configuration instance passed into the method which created the Job. i.e. change from: setModelPaths(Job job, Path modelPath) to: setModelPaths(Configuration conf, Path modelPath) And change all calling methods accordingly (obviously). So far what little testing I have done appears to solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix
[ https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679858#comment-13679858 ] Grant Ingersoll commented on MAHOUT-1147: - Do you see: {code} echo Extracting Reuters $MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-out if [ $HADOOP_HOME != ] [ $MAHOUT_LOCAL == ] ; then echo Copying Reuters data to Hadoop set +e $HADOOP dfs -rmr ${WORK_DIR}/reuters-sgm $HADOOP dfs -rmr ${WORK_DIR}/reuters-out set -e $HADOOP dfs -put ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-sgm $HADOOP dfs -put ${WORK_DIR}/reuters-out ${WORK_DIR}/reuters-out fi {code} Also, I'm on #mahout on IRC if that helps us resolve this faster. CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix --- Key: MAHOUT-1147 URL: https://issues.apache.org/jira/browse/MAHOUT-1147 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.7 Environment: Eclipse IDE Java code base CVB0Driver Class setModelPaths(Job job, Path modelPath) - method Reporter: Jack Pay Assignee: Jake Mannix Labels: bug, cvb, fix, suggestion Fix For: 0.8 Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch Original Estimate: 24h Remaining Estimate: 24h Problem: When training doc/topic model no paths for the term/topic model found (outputs null). These paths are set using setModelPaths in CVB0Driver. Reason for Problem: Variety of Job instances call this method. The Job is passed to the method instead of the Configuration object given to the Job. The configuration is retrieved from the Job instance itself. I believe that this Configuration instance is a clone of the original. This is a problem as the variable MODEL_PATHS is set on the clone which is then discarded when the given Job is complete. The original Configuration has no MODEL_PATHS String set and therefore returns null. The code stipulates that if it cannot find a model to use a new random matrix. This happens every time as MODEL_PATHS is not set for the Configuration instance used. Solution: Do not pass the Job to the setModels method, but pass the Configuration instance passed into the method which created the Job. i.e. change from: setModelPaths(Job job, Path modelPath) to: setModelPaths(Configuration conf, Path modelPath) And change all calling methods accordingly (obviously). So far what little testing I have done appears to solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Welcome new committers Gokhan Capan and Stevo Slavic
Please join me in congratulating Mahout's newest committers, Gokhan Capan and Stevo Slavic, both of whom have been contributing to Mahout for some time now. Gokhan, Stevo, new committer tradition is to give a brief background on yourself, so you have the floor! Congrats, Grant
[jira] [Created] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop
Grant Ingersoll created MAHOUT-1247: --- Summary: cluster-reuters doesn't work on Hadoop Key: MAHOUT-1247 URL: https://issues.apache.org/jira/browse/MAHOUT-1247 Project: Mahout Issue Type: Bug Reporter: Grant Ingersoll Fix For: 0.8 At least two issues: 1. MAHOUT-992 messed up the Distributed Cache stuff somehow 2. The ExtractReuters data is not being moved to HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop
[ https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned MAHOUT-1247: --- Assignee: Grant Ingersoll cluster-reuters doesn't work on Hadoop -- Key: MAHOUT-1247 URL: https://issues.apache.org/jira/browse/MAHOUT-1247 Project: Mahout Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.8 At least two issues: 1. MAHOUT-992 messed up the Distributed Cache stuff somehow 2. The ExtractReuters data is not being moved to HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1126) Mac builds won't unjar
[ https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1126. - Resolution: Fixed I think the filter I put in place should (hopefully) fix this going forward. Mac builds won't unjar -- Key: MAHOUT-1126 URL: https://issues.apache.org/jira/browse/MAHOUT-1126 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.8 Environment: Builds on the Mac Reporter: Pat Ferrel Assignee: Grant Ingersoll Labels: build Fix For: 0.8 On the Mac you have to remove the licenses in the mahout jar or hadoop can't unjar mahout. The Mac has a case insensitive file system and so can't tell the difference between LICENSE and license. This was fixed at one point https://issues.apache.org/jira/browse/MAHOUT-780 zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar META-INF/license/ zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar META-INF/LICENSE/ Looks like as is mentioned in https://issues.apache.org/jira/browse/MAHOUT-780 mv target/maven-shared-archive-resources/META-INF/LICENSE target/maven-shared-archive-resources/META-INF/LICENSES works too. Can this get a permanent fix? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1103) clusterpp is not writing directories for all clusters
[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1103. - Resolution: Fixed clusterpp is not writing directories for all clusters - Key: MAHOUT-1103 URL: https://issues.apache.org/jira/browse/MAHOUT-1103 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.8 Reporter: Matt Molek Assignee: Grant Ingersoll Labels: clusterpp Fix For: 0.8 Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch, MAHOUT-1103.patch After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. Here is my command sequence for the k=2 run: {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives: VL-3742464 - -685560454 VL-3742466 - -685560452 Finally, when running with -xm sequential, everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Random Errors
I get a failure on the one below when running in parallel, but not standalone: Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 10.358 sec FAILURE! testRun(org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest) Time elapsed: 10.358 sec FAILURE! java.lang.AssertionError: expected:2002 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest.testRun(SequenceFilesFromLuceneStorageMRJobTest.java:73) Interesting thing about this one is the Test class has only a single test and it has no randomization. FWIW, it's also becoming increasingly clear to me that we need some notion of real integration tests that we can run against a Hadoop cluster (or at least a virtual Hadoop cluster). -Grant On Jun 8, 2013, at 9:38 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: number generators. Where a test depends on a particular sequence, and somewhere an RNG doesn't use the RandomUtils trick, it may have a different state if other tests ran before. I have a different solution for this in randomizedtesting framework (a Random instance cannot be shared from test to test, it will throw an exception if you do share it). This doesn't solve all the possible problems but proved quite effective at catching test dependencies. The surefire parameter just controls what order the *classes* run in AFAICT: http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#runOrder Yeah, I was on the train when I wrote that e-mail. The trick I remembered is in fact inside JUnit 4.11 and onwards -- https://github.com/junit-team/junit/blob/master/doc/ReleaseNotes4.11.md#test-execution-order D. Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Assigned] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls
[ https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned MAHOUT-1211: --- Assignee: Grant Ingersoll (was: Ted Dunning) Replace deprecated Closables.closeQuietly calls --- Key: MAHOUT-1211 URL: https://issues.apache.org/jira/browse/MAHOUT-1211 Project: Mahout Issue Type: Improvement Reporter: Stevo Slavic Assignee: Grant Ingersoll Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1211.patch Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's usage is a code smell, and that method is scheduled to be removed from Guava 16.0. See [this discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] for more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 progress
I'm on M-1211 and 1247 (M-992 is related) Will be on IRC for a few hours this morning. -Grant On Jun 9, 2013, at 1:48 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Working on M-833. From: Suneel Marthi suneel_mar...@yahoo.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Saturday, June 8, 2013 6:09 PM Subject: Re: 0.8 progress I will be looking at M-833 and M-1030 tonight. I can get the initial limited functionality for M-884 as part of 0.8 release by tomorrow. Thanks to Robin for reviewing. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Saturday, June 8, 2013 5:09 PM Subject: Re: 0.8 progress I've got 1103 and 1126 close to done. Should be in by tomorrow. On Jun 8, 2013, at 4:18 PM, Robin Anil robin.a...@gmail.com wrote: Down to 15. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sat, Jun 8, 2013 at 12:30 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: I am done with M-1026. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Saturday, June 8, 2013 10:42 AM Subject: Re: 0.8 progress Hmm, JIRA seems to be down... 1084 is in. I'm pretty close to being done on 1103. I'm on #mahout on Freenode if anyone wants to coordinate, and will be there for the next 1 hour or so. On Jun 8, 2013, at 7:21 AM, Grant Ingersoll gsing...@apache.org wrote: We are down to 18 issues! Let's keep cranking. I'm working on 1103 and 1084 at the moment. On Jun 6, 2013, at 12:00 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 6, 2013, at 12:12 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: Hi Grant, Here's my take: Will/Must be finished: M-944[include] ^ Committed. M-958 [include] M-975[include] M-1084 [include] M-1098 [include] M-1103 [include] M-1126[push if no one steps up] M-1147 [include] M-1211 [push if no one steps up] M-1233 [push if no one steps up] M-1241 [include] Can be pushed if no one steps up: M-627 [push if no one steps up] M-833 [push if no one steps up] M-1163 [push if no one steps up] M-1164[push if no one steps up] M-1243[include] M-992 [include] ^ Working on this now. M-996 [push if no one steps up] M-1067[include] Unsure: M-974 [push if no one steps up] M-1026 [push if no one steps up] M-1030 [unsure] On 06.06.2013 11:26, Grant Ingersoll wrote: Working from the link below, we are down to 22 issues. https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC Here's my opinion (and only my opinion, please vote, change as you see fit) based on a cursory glance of the state of these as to what needs to be in the release and what can be pushed: Will/Must be finished: M-944 M-958 M-975 M-1084 M-1098 M-1103 M-1126 M-1147 M-1211 M-1233 M-1241 Can be pushed if no one steps up: M-627 M-833 M-1163 M-1164 M-1243 M-992 M-996 M-1067 Unsure: M-974 M-1026 M-1030 Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls
[ https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679048#comment-13679048 ] Grant Ingersoll commented on MAHOUT-1211: - Patch coming shortly based off of Suneel's original patch. Would appreciate some eyeballs before committing. I went with Sean's approach for readers and writers. I think Dmitriy has a valid point, but perhaps we take it on a case by case base to see if any harm comes out of quietly closing readers. Replace deprecated Closables.closeQuietly calls --- Key: MAHOUT-1211 URL: https://issues.apache.org/jira/browse/MAHOUT-1211 Project: Mahout Issue Type: Improvement Reporter: Stevo Slavic Assignee: Grant Ingersoll Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1211.patch Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's usage is a code smell, and that method is scheduled to be removed from Guava 16.0. See [this discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] for more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls
[ https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1211: Attachment: MAHOUT-1211.patch Updated patch to trunk Replace deprecated Closables.closeQuietly calls --- Key: MAHOUT-1211 URL: https://issues.apache.org/jira/browse/MAHOUT-1211 Project: Mahout Issue Type: Improvement Reporter: Stevo Slavic Assignee: Grant Ingersoll Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's usage is a code smell, and that method is scheduled to be removed from Guava 16.0. See [this discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] for more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Random Errors
Tests run: 100, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.75 sec FAILURE! testViewSequentialAccessSparseVectorWritable {#1 seed=[34643F377C10C8B9:3D6AC6E0C554E86F]}(org.apache.mahout.math.VectorWritableTest) Time elapsed: 0.423 sec ERROR! com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from TEST scope at testViewSequentialAccessSparseVectorWritable {#1 seed=[34643F377C10C8B9:3D6AC6E0C554E86F]}(org.apache.mahout.math.VectorWritableTest): 1) Thread[id=13, name=Thread-2, state=RUNNABLE, group=main] at com.apple.java.Application.getAppBundleIdNative(Native Method) at com.apple.java.Application.getAppBundleId(Application.java:19) at com.apple.java.Usage.performReport(Usage.java:52) at com.apple.java.Usage.performAfterDelay(Usage.java:27) at __randomizedtesting.SeedInfo.seed([34643F377C10C8B9:3D6AC6E0C554E86F]:0) This may be a hint. Don't get it when running it standalone... On Jun 9, 2013, at 8:50 AM, Sebastian Schelter ssc.o...@googlemail.com wrote: I observe a similar behavior. On 09.06.2013 14:47, Grant Ingersoll wrote: I get a failure on the one below when running in parallel, but not standalone: Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 10.358 sec FAILURE! testRun(org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest) Time elapsed: 10.358 sec FAILURE! java.lang.AssertionError: expected:2002 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest.testRun(SequenceFilesFromLuceneStorageMRJobTest.java:73) Interesting thing about this one is the Test class has only a single test and it has no randomization. FWIW, it's also becoming increasingly clear to me that we need some notion of real integration tests that we can run against a Hadoop cluster (or at least a virtual Hadoop cluster). -Grant On Jun 8, 2013, at 9:38 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: number generators. Where a test depends on a particular sequence, and somewhere an RNG doesn't use the RandomUtils trick, it may have a different state if other tests ran before. I have a different solution for this in randomizedtesting framework (a Random instance cannot be shared from test to test, it will throw an exception if you do share it). This doesn't solve all the possible problems but proved quite effective at catching test dependencies. The surefire parameter just controls what order the *classes* run in AFAICT: http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#runOrder Yeah, I was on the train when I wrote that e-mail. The trick I remembered is in fact inside JUnit 4.11 and onwards -- https://github.com/junit-team/junit/blob/master/doc/ReleaseNotes4.11.md#test-execution-order D. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls
[ https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679053#comment-13679053 ] Grant Ingersoll commented on MAHOUT-1211: - I committed this, but we can leave open for others to review and tweak, but it should be able to be closed before the release. Replace deprecated Closables.closeQuietly calls --- Key: MAHOUT-1211 URL: https://issues.apache.org/jira/browse/MAHOUT-1211 Project: Mahout Issue Type: Improvement Reporter: Stevo Slavic Assignee: Grant Ingersoll Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's usage is a code smell, and that method is scheduled to be removed from Guava 16.0. See [this discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] for more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop
[ https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679074#comment-13679074 ] Grant Ingersoll commented on MAHOUT-1247: - Here's the first error I'm getting: https://paste.apache.org/cik6 {quote} java.lang.IllegalStateException: /tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0 at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63) at org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:146) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1479) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1474) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.init(SequenceFileIterator.java:58) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61) ... 9 more {quote} Might be related to MAHOUT-992, but not sure. I added a main to DictionaryVectorizer that allows you to reproduce this off of the prior run of cluster-reuters without having to go re-run everything. cluster-reuters doesn't work on Hadoop -- Key: MAHOUT-1247 URL: https://issues.apache.org/jira/browse/MAHOUT-1247 Project: Mahout Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.8 At least two issues: 1. MAHOUT-992 messed up the Distributed Cache stuff somehow 2. The ExtractReuters data is not being moved to HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop
[ https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679076#comment-13679076 ] Grant Ingersoll commented on MAHOUT-1247: - After you run cluster-reuters.sh, you can run: {code}bin/mahout org.apache.mahout.vectorizer.DictionaryVectorizer -i /tmp/mahout-work-grantingersoll/reuters-out-seqdir-sparse-kmeans/tokenized-documents -o ./dicVec{code} Make sure you have HADOOP_HOME set and also substitute in the appropriate work directory. cluster-reuters doesn't work on Hadoop -- Key: MAHOUT-1247 URL: https://issues.apache.org/jira/browse/MAHOUT-1247 Project: Mahout Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.8 At least two issues: 1. MAHOUT-992 messed up the Distributed Cache stuff somehow 2. The ExtractReuters data is not being moved to HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop
[ https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679090#comment-13679090 ] Grant Ingersoll commented on MAHOUT-1247: - I think I see the issue. The cache file is local, the Iterator, however, has a Hadoop conf that is expecting an HDFS file, hence it can't find it. cluster-reuters doesn't work on Hadoop -- Key: MAHOUT-1247 URL: https://issues.apache.org/jira/browse/MAHOUT-1247 Project: Mahout Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.8 At least two issues: 1. MAHOUT-992 messed up the Distributed Cache stuff somehow 2. The ExtractReuters data is not being moved to HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-975) Bug in Gradient Machine - Computation of the gradient
[ https://issues.apache.org/jira/browse/MAHOUT-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679143#comment-13679143 ] Grant Ingersoll commented on MAHOUT-975: [~tdunning] Any chance this is getting in this week? Bug in Gradient Machine - Computation of the gradient -- Key: MAHOUT-975 URL: https://issues.apache.org/jira/browse/MAHOUT-975 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7 Reporter: Christian Herta Assignee: Ted Dunning Fix For: 0.8 Attachments: GradientMachine.patch The initialisation to compute the gradient descent weight updates for the output units should be wrong: In the comment: dy / dw is just w since y = x' * w + b. This is wrong. dy/dw is x (ignoring the indices). The same initialisation is done in the code. Check by using neural network terminology: The gradient machine is a specialized version of a multi layer perceptron (MLP). In a MLP the gradient for computing the weight change for the output units is: dE / dw_ij = dE / dz_i * dz_i / d_ij with z_i = sum_j (w_ij * a_j) here: i index of the output layer; j index of the hidden layer (d stands for the partial derivatives) here: z_i = a_i (no squashing in the output layer) with the special loss (cost function) is E = 1 - a_g + a_b = 1 - z_g + z_b with g index of output unit with target value: +1 (positive class) b: random output unit with target value: 0 = dE / dw_gj = -dE/dz_g * dz_g/dw_gj = -1 * a_j (a_j: activity of the hidden unit j) dE / dw_bj = -dE/dz_b * dz_b/dw_bj = +1 * a_j (a_j: activity of the hidden unit j) That's the same if the comment would be correct: dy /dw = x (x is here the activation of the hidden unit) * (-1) for weights to the output unit with target value +1. In neural network implementations it's common to compute the gradient numerically for a test of the implementation. This can be done by: dE/dw_ij = (E(w_ij + epsilon) -E(w_ij - epsilon) ) / (2* (epsilon)) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop
[ https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1247. - Resolution: Fixed Fixed by MAHOUT-992 cluster-reuters doesn't work on Hadoop -- Key: MAHOUT-1247 URL: https://issues.apache.org/jira/browse/MAHOUT-1247 Project: Mahout Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.8 At least two issues: 1. MAHOUT-992 messed up the Distributed Cache stuff somehow 2. The ExtractReuters data is not being moved to HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira