Re: VOTE: moving commits to git-wp.o.a github PR features.

2014-05-19 Thread Grant Ingersoll
+1

On May 16, 2014, at 2:02 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Hi,
 
 I would like to initiate a procedural vote moving to git as our primary
 commit system, and using github PRs as described in Jake Farrel's email to
 @dev [1]
 
 [1]
 https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
 
 If voting succeeds, i will file a ticket with infra to commence necessary
 changes and to move our project to git-wp as primary source for commits as
 well as add github integration features [1]. (I assume pure git commits
 will be required after that's done, with no svn commits allowed).
 
 The motivation is to engage GIT and github PR features as described, and
 avoid git mirror history messes like we've seen associated with authors.txt
 file fluctations.
 
 PMC and committers have binding votes, so please vote. Lazy consensus with
 minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
 for weekend (i.e. Tuesday afternoon PST) .
 
 here is my +1
 
 -d


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: [MAHOUT-EXAMPLES Jenkins] Mahout-Examples-Cluster-Reuters-II - Build # 831 - Still Failing

2014-05-04 Thread Grant Ingersoll
I gave a few more people access: Frank, you and Andrew.  Happy to add others.

-Grant

On May 2, 2014, at 2:29 PM, Sebastian Schelter s...@apache.org wrote:

 Do we have access now to fix the build? This becomes really annoying, we only 
 have to change a few lines in the jenkins config...
 
 On 05/02/2014 08:24 PM, Apache Jenkins Server wrote:
 The Apache Jenkins build system has built Mahout-Examples-Cluster-Reuters-II 
 (build #831)
 
 Status: Still Failing
 
 Check console output at 
 https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/831/ to 
 view the results.
 
 




Re: Board Report

2014-04-08 Thread Grant Ingersoll

On Apr 7, 2014, at 10:29 AM, Pat Ferrel p...@occamsmachete.com wrote:

 The document does not mention the state of the existing Spark work in the 
 snapshot codebase. Shouldn’t this be noted?

It's under the community section.


 
 On Apr 7, 2014, at 5:06 AM, Sebastian Schelter s...@apache.org wrote:
 
 I think we should mention the redesign/rework of the website and the 
 completion of the move from the old wiki to Apache CMS.
 
 --sebastian
 
 On 04/07/2014 02:04 PM, Grant Ingersoll wrote:
 Here is my proposed report.  For the most part, I think the only right thing 
 to do vis-a-vis the Board is to report that we are in the midst of a healthy 
 (yes, I believe it is, for the most part healthy and normal) discussion on 
 where to go next.
 
 PMC Members: this is checked into SVN at 
 https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt.
   It is due on Wednesday.  If you object to this approach of reporting, 
 please let me know ASAP and suggest alternatives.
 
 === Apache Mahout Status Report: April 2014 ===
 
 -
 
 Apache Mahout has implementations of a wide range of machine learning and
 data mining algorithms: clustering, classification, collaborative filtering
 and frequent pattern mining
 
 Project Status
 --
 
 The project continues to have a large and active user base.  While
 the developer base has continued to grow, there is a very active
 and healthy debate going on about where Mahout goes next.  Please
 see the Issues section below for more details.
 
 Community
 -
 
 * Andrew Musselman was voted in as new committer.
 * No changes to the PMC in the reporting period.
 
 * The main issue concerning the community right now is the addition
 of new contributions from 0xData and the integration of Mahout with Spark.
 
 Community Objectives
 
 
 Our goal is to build scalable machine learning libraries. See the Issues
 section below for the debate in the community about our objectives.
 
 
 Releases
 
 
 In addition to an ongoing debate on Mahout's future, the community is 
 actively
 working on integrating Mahout with Scala/Spark, updating
 documentation, and bringing in new code and committers to update the core 
 project.
 
 
 Issues
 --
 The Mahout community is at a crossroads in terms of where
 to go next.  While the project has a broad number of users and interested
 parties, most committers are trying to maintain the code base on a purely
 part time basis, when the amount of work to sustain these users
 clearly points to it needing to
 be full time.  Furthermore, much of our original code base is written
 for Hadoop MapReduce 1.0, which many in the community have come to realize
 is not well-suited for solving the kinds of problems that Mahout has set
 out to solve.  There have been several lengthy discussions and prototypes
 going on to work out next directions along the lines of the Spark and
 0xData contributions (there are numerous threads on the dev@mahout.a.o
 mailing list.)
 
 The PMC does not think this requires Board intervention at this time
 as the debate is, as far as we can tell, healthy.  We do, however,
 expect that this debate will take some time to resolve and may mean we
 won't be shipping a 1.0 release any time soon.  We will keep the Board
 apprised of our next steps as we work through the process.
 
 
 
 
 On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote:
 
 To Sean's point, if Mahout were my company, I would do the following, 
 albeit pragmatic and not so pleasant thing, assuming, of course, I had the 
 $$$ to do so:
 
 1. Clean up existing code with a laser focus on a few key areas 
 (Sebastian's list makes sense) using a part of the team and call it 1.0 and 
 ship it, as it has a number of users and they deserve to not have the rug 
 pulled out from under them.
 
 2. Spin out a subset of the team to explore and prototype 2.0 based on two 
 very positive and re-energizing looking ideas:
 a. Scala DSL (and maybe Spark)
 b. 0xData
 
 All of the work for #2 would be done in a clean repo and would only 
 bring in legacy code where it was truly beneficial (back compat. can come 
 later, if at all).
 It would then benchmark those two approaches as well as look at where 
 they overlap and are mutually beneficial and then go forward with the 
 winner.
 
 3. Once #2 is viable, put most effort into it and maintain 1.0 with as 
 minimal support as possible, encouraging, neh -- actively helping -- 1.0 
 customers upgrade as quickly as possible.
 
 The tricky part then becomes how do you make sure to still make your sales 
 #'s while also convincing them that your roadmap is what they are really 
 buying.
 
 If I didn't have the $$$ to do both of these (i.e. we need a massive turn 
 around and we have one last shot), I would be all in on #2.
 
 ---
 
 That being said, Mahout is not my company.  Heck, Mahout is not even

Re: Board Report

2014-04-08 Thread Grant Ingersoll

On Apr 7, 2014, at 11:03 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Mahout needs a reboot. Grant has the right perspective, but I’d take it 
 further. His #2 (two efforts) is not and never would be reasonable in 
 anything but a huge company. 
 

FWIW, that was my view _if_ I were in a company funding it.  Further down, my 
take is that for the most part we should follow the natural Apache way and let 
those who do the work make the choices, which AFAICT, point at forgetting about 
#1 and pursuing #2 only.

-Grant


 I have never and would never take a team the size of Mahout (even with some 
 new commiters) and split a reboot into two parts on two engines. No sane 
 project manager would allow this. Why do we think it will work here?
 
 The recent Gigaom article left me sympathetic with how confused the readers 
 must be, let alone potential users or contributors.
 
 Sean is not being nihilistic, two directions will not work for Mahout. Mahout 
 has a bad reputation already for being a poorly documented and a poorly 
 integrated loose collections of code with a lot of technical debt. Honestly 
 has anyone reading this seen increasing interest in the project? A reboot is 
 the only thing I can imagine to re-energize it and even that must be done 
 with the utmost in clear communication.
 
 If you accept the above then there seem to be some ways forward:
 1) reboot on Spark, let 0xdata do what they will.
 2) reboot on 0xdata and let the Spark commiters consider becoming MLlib 
 commiters or other. 
 3) fail by issuing confusing direction statements, spending too much time 
 supporting and reconciling multiple significantly disparate efforts and 
 dividing commiters. This is such a classic fail that I have a hard time even 
 considering it.
 
 I’d like to see #1 for what it’s worth. A concerted effort by all on #1 would 
 ensure Mahout is included in future distros. Maybe even #2 would be included 
 but #3? It’s a non-starter.
 
 On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote:
 
 To Sean's point, if Mahout were my company, I would do the following, 
 albeit pragmatic and not so pleasant thing, assuming, of course, I had the 
 $$$ to do so:
 
 1. Clean up existing code with a laser focus on a few key areas (Sebastian's 
 list makes sense) using a part of the team and call it 1.0 and ship it, as it 
 has a number of users and they deserve to not have the rug pulled out from 
 under them.  
 
 2. Spin out a subset of the team to explore and prototype 2.0 based on two 
 very positive and re-energizing looking ideas:
   a. Scala DSL (and maybe Spark)
   b. 0xData
   
   All of the work for #2 would be done in a clean repo and would only 
 bring in legacy code where it was truly beneficial (back compat. can come 
 later, if at all).
   It would then benchmark those two approaches as well as look at where 
 they overlap and are mutually beneficial and then go forward with the winner.
 
 3. Once #2 is viable, put most effort into it and maintain 1.0 with as 
 minimal support as possible, encouraging, neh -- actively helping -- 1.0 
 customers upgrade as quickly as possible.
 
 The tricky part then becomes how do you make sure to still make your sales 
 #'s while also convincing them that your roadmap is what they are really 
 buying.
 
 If I didn't have the $$$ to do both of these (i.e. we need a massive turn 
 around and we have one last shot), I would be all in on #2.
 
 ---
 
 That being said, Mahout is not my company.  Heck, Mahout is not even a 
 company, so we don't need to be bound by company conventions and thought 
 processes, even if that fits with all of our individual day jobs.  And, 
 thankfully, we don't have any sales numbers to make.
 
 We are chartered with one and only one mission: produce open source, scalable 
 machine learning libraries under the Apache license and community driven 
 principles.  We are not required by the Board or anyone else to support 
 version X for Y years or to use Hadoop or Scala or Java.  We are also not 
 required to implement any specific algorithms or deliver them on specific 
 time frames.  We are also not required to provide users upgrade paths or the 
 like.  Naturally, we _want_ to do these things for the sake of the community, 
 but let's be clear: it is not a requirement from the ASF.  We are, however, 
 required, to have a sustaining community. 
 
 
 
 I personally think we should start clean on #2, throwing off the shackles of 
 the past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that, 
 not 0.1 as Sebastian suggests, for marketing reasons) built on a completely 
 new and fresh repository, likely bringing in only the Math/collections 
 underpinnings and maybe the build system.  This new repository would have 
 only a handful of core algorithms that we know are well implemented, 
 sustainable and best in class.  
 
 I think we

Re: Board Report

2014-04-07 Thread Grant Ingersoll
To Sean's point, if Mahout were my company, I would do the following, albeit 
pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do 
so:

1. Clean up existing code with a laser focus on a few key areas (Sebastian's 
list makes sense) using a part of the team and call it 1.0 and ship it, as it 
has a number of users and they deserve to not have the rug pulled out from 
under them.  

2. Spin out a subset of the team to explore and prototype 2.0 based on two very 
positive and re-energizing looking ideas:
a. Scala DSL (and maybe Spark)
b. 0xData

All of the work for #2 would be done in a clean repo and would only 
bring in legacy code where it was truly beneficial (back compat. can come 
later, if at all).
It would then benchmark those two approaches as well as look at where 
they overlap and are mutually beneficial and then go forward with the winner.

3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal 
support as possible, encouraging, neh -- actively helping -- 1.0 customers 
upgrade as quickly as possible.

The tricky part then becomes how do you make sure to still make your sales #'s 
while also convincing them that your roadmap is what they are really buying.

If I didn't have the $$$ to do both of these (i.e. we need a massive turn 
around and we have one last shot), I would be all in on #2.

---

That being said, Mahout is not my company.  Heck, Mahout is not even a 
company, so we don't need to be bound by company conventions and thought 
processes, even if that fits with all of our individual day jobs.  And, 
thankfully, we don't have any sales numbers to make.

We are chartered with one and only one mission: produce open source, scalable 
machine learning libraries under the Apache license and community driven 
principles.  We are not required by the Board or anyone else to support version 
X for Y years or to use Hadoop or Scala or Java.  We are also not required to 
implement any specific algorithms or deliver them on specific time frames.  We 
are also not required to provide users upgrade paths or the like.  Naturally, 
we _want_ to do these things for the sake of the community, but let's be clear: 
it is not a requirement from the ASF.  We are, however, required, to have a 
sustaining community. 



I personally think we should start clean on #2, throwing off the shackles of 
the past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that, 
not 0.1 as Sebastian suggests, for marketing reasons) built on a completely new 
and fresh repository, likely bringing in only the Math/collections 
underpinnings and maybe the build system.  This new repository would have only 
a handful of core algorithms that we know are well implemented, sustainable and 
best in class.  

I think we should look at the lead up to 0.9 as an experiment that proved out a 
lot of interesting ideas, including the fact that Mahout proved there is vast 
interest in open source large scale machine learning and that it is the 
benchmark for comparison.  Not many other ML projects can say that, even if 
they have better technical implementations or are less fragmented.  Once you 
realize something has outlived it's usefulness in software, however, there is 
no point in lingering.

That being said, at least for the foreseeable future, I am not in a position to 
contribute much code.  So, from my perspective, the ASF Meritocratic approach 
takes over:  those who do the work make the decisions.  If you want something 
in, then put up the patch and ask for feedback.  If no one provides feedback, 
assume lazy consensus and move forward.  Nothing convinces people better than 
actual, real, executing code.  For my part, I am happy to continue to work the 
bureaucratic side of things to make sure reports get filed, credentials get 
created, etc. and the occasional patch.  I hope one day I will have time to 
contribute again.

I will follow up w/ a separate email on what I am going to put in the Board 
Report.

On Apr 7, 2014, at 1:52 AM, Sean Owen sro...@gmail.com wrote:

 No, it's about the opposite. I'm referring to the default, current
 state of play here.
 
 The issues for a vendor are demand and supportability. Do people want
 to pay for support of X? Can you honestly say you have expertise to
 support and influence X over at least a major release cycle (12-18
 months)? The latter needs a reasonably reliable roadmap and
 continuity.
 
 I'm suggesting that in the current state, demand is low and going
 down. The current code base seems de facto deprecated/unsupported
 already, and possibly to be removed or dramatically changed into
 something as-yet unclear. Nobody here seems to have taken a hard
 decision regarding a next major release, but, the trajectory of that
 decision seems clear if the current state remains the same.
 
 From my perspective, 

Re: Board Report

2014-04-07 Thread Grant Ingersoll
Here is my proposed report.  For the most part, I think the only right thing to 
do vis-a-vis the Board is to report that we are in the midst of a healthy (yes, 
I believe it is, for the most part healthy and normal) discussion on where to 
go next.

PMC Members: this is checked into SVN at 
https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt.
  It is due on Wednesday.  If you object to this approach of reporting, please 
let me know ASAP and suggest alternatives.

=== Apache Mahout Status Report: April 2014 ===

-

Apache Mahout has implementations of a wide range of machine learning and
data mining algorithms: clustering, classification, collaborative filtering
and frequent pattern mining

Project Status
--

The project continues to have a large and active user base.  While
the developer base has continued to grow, there is a very active
and healthy debate going on about where Mahout goes next.  Please
see the Issues section below for more details.

Community
-

* Andrew Musselman was voted in as new committer.
* No changes to the PMC in the reporting period.

* The main issue concerning the community right now is the addition
of new contributions from 0xData and the integration of Mahout with Spark.

Community Objectives


Our goal is to build scalable machine learning libraries. See the Issues
section below for the debate in the community about our objectives.


Releases


In addition to an ongoing debate on Mahout's future, the community is actively
 working on integrating Mahout with Scala/Spark, updating
documentation, and bringing in new code and committers to update the core 
project.


Issues
--
The Mahout community is at a crossroads in terms of where
to go next.  While the project has a broad number of users and interested 
parties, most committers are trying to maintain the code base on a purely
part time basis, when the amount of work to sustain these users
clearly points to it needing to
be full time.  Furthermore, much of our original code base is written
for Hadoop MapReduce 1.0, which many in the community have come to realize
is not well-suited for solving the kinds of problems that Mahout has set
out to solve.  There have been several lengthy discussions and prototypes
going on to work out next directions along the lines of the Spark and 
0xData contributions (there are numerous threads on the dev@mahout.a.o
mailing list.)  

The PMC does not think this requires Board intervention at this time
as the debate is, as far as we can tell, healthy.  We do, however,
expect that this debate will take some time to resolve and may mean we
won't be shipping a 1.0 release any time soon.  We will keep the Board
apprised of our next steps as we work through the process.




On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote:

 To Sean's point, if Mahout were my company, I would do the following, 
 albeit pragmatic and not so pleasant thing, assuming, of course, I had the 
 $$$ to do so:
 
 1. Clean up existing code with a laser focus on a few key areas (Sebastian's 
 list makes sense) using a part of the team and call it 1.0 and ship it, as it 
 has a number of users and they deserve to not have the rug pulled out from 
 under them.  
 
 2. Spin out a subset of the team to explore and prototype 2.0 based on two 
 very positive and re-energizing looking ideas:
   a. Scala DSL (and maybe Spark)
   b. 0xData
   
   All of the work for #2 would be done in a clean repo and would only 
 bring in legacy code where it was truly beneficial (back compat. can come 
 later, if at all).
   It would then benchmark those two approaches as well as look at where 
 they overlap and are mutually beneficial and then go forward with the winner.
 
 3. Once #2 is viable, put most effort into it and maintain 1.0 with as 
 minimal support as possible, encouraging, neh -- actively helping -- 1.0 
 customers upgrade as quickly as possible.
 
 The tricky part then becomes how do you make sure to still make your sales 
 #'s while also convincing them that your roadmap is what they are really 
 buying.
 
 If I didn't have the $$$ to do both of these (i.e. we need a massive turn 
 around and we have one last shot), I would be all in on #2.
 
 ---
 
 That being said, Mahout is not my company.  Heck, Mahout is not even a 
 company, so we don't need to be bound by company conventions and thought 
 processes, even if that fits with all of our individual day jobs.  And, 
 thankfully, we don't have any sales numbers to make.
 
 We are chartered with one and only one mission: produce open source, scalable 
 machine learning libraries under the Apache license and community driven 
 principles.  We are not required by the Board or anyone else to support 
 version X for Y years or to use Hadoop or Scala or Java.  We are also not 
 required to implement any specific algorithms

Re: Board Report

2014-04-07 Thread Grant Ingersoll
Good point, please update the report (you should have credentials)

-Grant

On Apr 7, 2014, at 5:06 AM, Sebastian Schelter s...@apache.org wrote:

 I think we should mention the redesign/rework of the website and the 
 completion of the move from the old wiki to Apache CMS.
 
 --sebastian
 
 On 04/07/2014 02:04 PM, Grant Ingersoll wrote:
 Here is my proposed report.  For the most part, I think the only right thing 
 to do vis-a-vis the Board is to report that we are in the midst of a healthy 
 (yes, I believe it is, for the most part healthy and normal) discussion on 
 where to go next.
 
 PMC Members: this is checked into SVN at 
 https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt.
   It is due on Wednesday.  If you object to this approach of reporting, 
 please let me know ASAP and suggest alternatives.
 
 === Apache Mahout Status Report: April 2014 ===
 
 -
 
 Apache Mahout has implementations of a wide range of machine learning and
 data mining algorithms: clustering, classification, collaborative filtering
 and frequent pattern mining
 
 Project Status
 --
 
 The project continues to have a large and active user base.  While
 the developer base has continued to grow, there is a very active
 and healthy debate going on about where Mahout goes next.  Please
 see the Issues section below for more details.
 
 Community
 -
 
 * Andrew Musselman was voted in as new committer.
 * No changes to the PMC in the reporting period.
 
 * The main issue concerning the community right now is the addition
 of new contributions from 0xData and the integration of Mahout with Spark.
 
 Community Objectives
 
 
 Our goal is to build scalable machine learning libraries. See the Issues
 section below for the debate in the community about our objectives.
 
 
 Releases
 
 
 In addition to an ongoing debate on Mahout's future, the community is 
 actively
  working on integrating Mahout with Scala/Spark, updating
 documentation, and bringing in new code and committers to update the core 
 project.
 
 
 Issues
 --
 The Mahout community is at a crossroads in terms of where
 to go next.  While the project has a broad number of users and interested
 parties, most committers are trying to maintain the code base on a purely
 part time basis, when the amount of work to sustain these users
 clearly points to it needing to
 be full time.  Furthermore, much of our original code base is written
 for Hadoop MapReduce 1.0, which many in the community have come to realize
 is not well-suited for solving the kinds of problems that Mahout has set
 out to solve.  There have been several lengthy discussions and prototypes
 going on to work out next directions along the lines of the Spark and
 0xData contributions (there are numerous threads on the dev@mahout.a.o
 mailing list.)
 
 The PMC does not think this requires Board intervention at this time
 as the debate is, as far as we can tell, healthy.  We do, however,
 expect that this debate will take some time to resolve and may mean we
 won't be shipping a 1.0 release any time soon.  We will keep the Board
 apprised of our next steps as we work through the process.
 
 
 
 
 On Apr 7, 2014, at 4:53 AM, Grant Ingersoll gsing...@apache.org wrote:
 
 To Sean's point, if Mahout were my company, I would do the following, 
 albeit pragmatic and not so pleasant thing, assuming, of course, I had the 
 $$$ to do so:
 
 1. Clean up existing code with a laser focus on a few key areas 
 (Sebastian's list makes sense) using a part of the team and call it 1.0 and 
 ship it, as it has a number of users and they deserve to not have the rug 
 pulled out from under them.
 
 2. Spin out a subset of the team to explore and prototype 2.0 based on two 
 very positive and re-energizing looking ideas:
 a. Scala DSL (and maybe Spark)
 b. 0xData
 
 All of the work for #2 would be done in a clean repo and would only 
 bring in legacy code where it was truly beneficial (back compat. can come 
 later, if at all).
 It would then benchmark those two approaches as well as look at where 
 they overlap and are mutually beneficial and then go forward with the 
 winner.
 
 3. Once #2 is viable, put most effort into it and maintain 1.0 with as 
 minimal support as possible, encouraging, neh -- actively helping -- 1.0 
 customers upgrade as quickly as possible.
 
 The tricky part then becomes how do you make sure to still make your sales 
 #'s while also convincing them that your roadmap is what they are really 
 buying.
 
 If I didn't have the $$$ to do both of these (i.e. we need a massive turn 
 around and we have one last shot), I would be all in on #2.
 
 ---
 
 That being said, Mahout is not my company.  Heck, Mahout is not even a 
 company, so we don't need to be bound by company conventions and thought 
 processes, even if that fits with all of our individual day jobs

Re: Mail and IRC parsing

2014-04-04 Thread Grant Ingersoll
We've (LucidWorks) got full indexing and search of the Mahout mail archives at 
http://find.searchub.org.  We could probably add in IRC pretty easily if you 
want.

-Grant

On Mar 22, 2014, at 2:06 AM, Andrew Musselman andrew.mussel...@gmail.com 
wrote:

 I put up a parser for the IRC history logs here
 https://github.com/andrewmusselman/util/blob/master/irc-parser.sh
 
 I'd like to write one for the user list too to figure out the most common
 problems/questions so we can focus effort on repairs to bugs and docs.
 
 But the mail archives at
 https://mail-archives.apache.org/mod_mbox/mahout-user/ are dynamic, loaded
 in through JavaScript, so parsing them isn't that straightforward.
 
 Is it possible to get the mbox files directly?


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Board Report

2014-04-04 Thread Grant Ingersoll
Can someone summarize the 0xData and the Spark work for me for the board 
report?  I've unfortunately been too busy to keep up on the threads on it, but 
need to write the board report for this month.

You can either summarize here or add it to the community section at 
https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt

Also, assuming we are going ahead w/ the 0xData stuff, we likely need to do a 
software grant for that.

Thanks,
Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: 0xdata interested in contributing

2014-03-13 Thread Grant Ingersoll
 in the framework to be created, managed and deleted.
 
 There is also an R binding for h2o which allows programs to access and
 manage h2o objects.  Functions defined in an R-like language can be applied
 in parallel to
 data frames stored in the h2o framework.
 
 *Proposed Developer User Experience*
 
 I see several kinds of users.  These include numerical developers (largely
 mathematicians), Java or Scala developers (like current Mahout devs), and
 data
 analysts.
 
 - Local h2o single-node cluster
 - Temporary h2o cluster
 - Shared h2o cluster
 
 All of these modes will be facilitated by the proposed development.
 
 *Complementarity with Other Platforms*
 
 I view h2o as complementary with Hadoop and Spark because it provides a
 solid in-memory execution engine as opposed to a general out-of-core
 computation model that other map-reduce engines like Hadoop and Spark
 implement or more general dataflow systems like Stratosphere, Tez or Drill.
 
 Also, h2o provides no persistence but depends on other systems for that
 such as NFS, HDFS, NAS or MapR.
 
 H2o is also nicely complimentary to R in that R can invoke operations and
 move data to and from h2o very easily.
 
 *Required Additional Work*
 
 Sparse matrices
 Linear algebra bindings
 Class-file magic to allow off-the-cuff function definitions


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: MAHOUT 0.9 Release - New URL

2014-01-23 Thread Grant Ingersoll
+1 from me.

On Jan 22, 2014, at 5:55 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Fixed the issues that were reported this week and restored FP mining into the 
 codebase.
 
 Here's the URL for the final release in staging:-
 https://repository.apache.org/content/repositories/orgapachemahout-1003/org/apache/mahout/mahout-distribution/0.9/
 
 The artifacts have been signed with the following key:
 https://people.apache.org/keys/committer/smarthi.asc
 
 
 a) Verify that u can unpack the release (tar or zip)
 b) Verify u r able to compile the distro
 c)  Run through the unit tests: mvn clean test
 d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run 
 through all the different options in each script.
 
 Committers and PMC, need a minimum of 3 '+1' votes for the release to be 
 finalized.


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Mahout 0.9 Release - Call for Volunteers

2014-01-18 Thread Grant Ingersoll
 $MAHOUT_HOME/examples/bin. 
 Please run through all the different options in each script.
 
 
 
 Committers and PMC members:
 ---
 
 Need atleast 3 +1 votes from this group for the Release to pass.
 
 
 Thanks and Regards.
 
 
 
 
 --
 Thanks,
 Chameera
 
 
 
 
 --
 --
 Yexi Jiang,
 ECS 251,  yjian...@cs.fiu.edu
 School of Computer and Information Science, Florida International
 University
 Homepage: http://users.cis.fiu.edu/~yjian004/
 
 
 
 
 --
 --
 Yexi Jiang,
 ECS 251,  yjian...@cs.fiu.edu
 School of Computer and Information Science, Florida International 
 University
 Homepage: http://users.cis.fiu.edu/~yjian004/
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: MAHOUT 0.9 Release - New URL

2014-01-18 Thread Grant Ingersoll
Ran the tests, verified sigs, tried out a few of the examples.

+1 (binding)

On Jan 16, 2014, at 9:41 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Third time's a Charm!!!
 
 
 Here's the new URL for Mahout 0.9 Release:
 https://repository.apache.org/content/repositories/orgapachemahout-1002/org/apache/mahout/mahout-distribution/0.9/
 
 For those volunteering to test this, some of the things to be verified:
 
 a) Verify that u can unpack the release (tar or zip)
 b) Verify u r able to compile the distro
 c)  Run through the unit tests: mvn clean test
 d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run 
 through all the different options in each script.
  
 
 Committers
 and PMC members:
 ---
 
 Need 'at least 3 +1 votes' for the Release to pass. 
 
 
 Thanks and Regards.




[jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2013-10-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809813#comment-13809813
 ] 

Grant Ingersoll commented on MAHOUT-1030:
-

Andrew, I suppose it depends on what part of it you want to address.  If it is 
the literal part of this bug, Pat has been pretty responsive.  If it is the 
reworking of the properties of vectors, that is probably best handled on the 
mailing list.  The basic gist being we want to more intelligently handle vector 
properties and get rid of NamedVector.  [~tdunning], [~robinanil] and others 
may have some thoughts here as well.

(FWIW, I'd prefer the latter to be tackled.)

 Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
 WeightedVectorWritable
 

 Key: MAHOUT-1030
 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
 Project: Mahout
  Issue Type: Bug
  Components: Clustering, Integration
Affects Versions: 0.7
Reporter: Jeff Eastman
Assignee: Andrew Musselman
 Fix For: 1.0, 0.9

 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch


 Looks like this won't make it into this build. Pretty widespread impact on 
 code and tests and I don't know which properties were implemented in the old 
 version. I will create a JIRA and post my interim results.
 On 6/8/12 12:21 PM, Jeff Eastman wrote:
  That's a reversion that evidently got in when the new 
  ClusterClassificationDriver was introduced. It should be a pretty easy fix 
  and I will see if I can make the change before Paritosh cuts the release 
  bits tonight.
 
  On 6/7/12 1:00 PM, Pat Ferrel wrote:
  It appears that in kmeans the clusteredPoints are now written as 
  WeightedVectorWritable where in mahout 0.6 they were 
  WeightedPropertyVectorWritable? This means that the distance from the 
  centroid is no longer stored here? Why? I hope I'm wrong because that is 
  not a welcome change. How is one to order clustered docs by distance from 
  cluster centroid?
 
  I'm sure I could calculate the distance but that would mean looking up the 
  centroid for the cluster id given in the above WeightedVectorWritable, 
  which means iterating through all the clusters for each clustered doc. In 
  my case the number of clusters could be fairly large.
 
  Am I missing something?
 
 
 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Re: Mahout's future

2013-10-24 Thread Grant Ingersoll

On Oct 17, 2013, at 7:46 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 
 8.  Mahout-627: Baum Welch Algorithm on MapReduce for Parallel HMM Training
 
   Grant, do we need to push this to Backlog?

Yes.  Sorry for the delay, in a new role at work that is consuming most of my 
cycles at this point in time.




Re: Mahout's future

2013-10-15 Thread Grant Ingersoll

On Oct 15, 2013, at 1:21 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Will schedule a hangout for this Thursday - 7pm (Eastern Time) tentatively.
 

Sorry, just catching up.  I can't make this week, but can next.  Feel free to 
go ahead w/o me at this point, given the momentum.

 I would like us to first discuss about Mahout 0.9 release, will send out an 
 agenda once I schedule it.
 
 Regards,
 Suneel
 
 
 
 
 On Tuesday, October 15, 2013 12:24 AM, Saikat Kanjilal sxk1...@hotmail.com 
 wrote:
 
 Following up , Suneel/Grant are we still on for meeting this week on a google 
 hangout, would love to neet this week.
 
 From: sxk1...@hotmail.com
 To: dev@mahout.apache.org
 Subject: RE: Mahout's future
 Date: Sun, 6 Oct 2013 07:00:50 -0700
 
 +1Can you send out a quick agenda (hopefully with my input incorporated) 
 before the hangout?Regards
 Date: Sun, 6 Oct 2013 03:58:10 -0700
 From: suneel_mar...@yahoo.com
 Subject: Re: Mahout's future
 To: dev@mahout.apache.org
 
 Grant would be available the week of Oct 14 for a hangout (tentatively).
 We could go ahead and schedule one next week if there's (and seems very 
 much like it) enough response.  I can go ahead and facilitate one.
 
 I will be 100% focused on Mahout from next week once I start at my new job 
 from Monday. 
 
 Regarding building something for Deep Learning, Yexi's patch for MLP (see 
 M-1265) may be a good place to refactor/start thinking about the 
 foundations.
 I guess Ted is alluring to build something like what's been described in 
 the Google paper (see 
 http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf).
  Correct?
 
 
 Suneel
   
 
 
 
 
   From: Ted Dunning ted.dunn...@gmail.com
 To: dev@mahout.apache.org dev@mahout.apache.org 
 Cc: dev@mahout.apache.org dev@mahout.apache.org 
 Sent: Sunday, October 6, 2013 2:10 AM
 Subject: Re: Mahout's future
   
 
 Saikat
 
 These are all good suggestions.  I would have a hard time suggesting a 
 prioritization of them.  
 
 Does anybody remember what grant said about having another hangout?  
 
 Sent from my iPhone
 
 On Oct 6, 2013, at 7:15, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
 I wanted to mention a few other things:1)It might be useful to take and 
 embed a few already productionalized use cases into the integration tests 
 in mahout, this will help additional users get on board faster2) Deep 
 learning is really interesting, however I'd like to help research some 
 common use cases first before tying this into mahout3) It'd be good to put 
 some thought into  documenting when you would choose what type of 
 algorithm given a production machine learning recommendation system to 
 build, this would give more visibility for users into choosing the right 
 mixture of algorithms to build a production ready recommender, often what 
 I've found is that a bulk of the time in building productionalized 
 recommenders is spent cleaning and filtering noisy data4) I'd like to also 
 explore how to tie in machine learning algorithms into real time systems 
 built using twitter storm (http://storm-project.net/), it seems that 
 industry more and more is
 wanting
   to do real time analytics on the fly, I'm curious what type of algorithms 
 we'd need for this and back propagate these into mahout
 
 It'd be good to meet like minded devs  together locally (Seattle) or over 
 gtalk/conference to talk through possibilities.
 Regards
 From: ted.dunn...@gmail.com
 Date: Sat, 5 Oct 2013 18:13:40 -0700
 Subject: Re: Mahout's future
 To: dev@mahout.apache.org
 
 On Sat, Oct 5, 2013 at 5:08 PM, Saikat Kanjilal sxk1...@hotmail.com 
 wrote:
 
 Does it make sense to have a quick meeting of interested developers over
 google chat/conference rather than email to discuss and assign folks to
 specifics?
 
 Thoughts?
 
 Great idea.
 
 I think that Grant may have been organizing a hangout.
 



Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Reconsider moving to Apache CMS only? [Was: Confluence Wiki SPAM and new restrictions in place.]

2013-10-03 Thread Grant Ingersoll

On Oct 2, 2013, at 7:36 AM, Isabel Drost-Fromm isa...@apache.org wrote:

 
 Hi,
 
 this topic popped up a couple of times in the past - given the current spam 
 incident in the Apache confluence wikis, a few more restrictions were put 
 into 
 place for editing pages in the wiki:
 
 removed any editing access for the confluence-users group. From now on, if
 someone wants to edit your wiki, you have to whitelist them specifically. You
 can do this if they are a committer by listing them in the 'Individual
 Users' section of the Space Permissions area, or by asking that they be
 added to the special 'asf-cla' wiki group - we will check that they have a
 iCLA on file before adding them.
 
 Given the need to whitelist everyone who wants to do changes to the wiki 
 pages 
 I wonder whether it makes sense to move most of our docs over to Apache CMS 
 (except maybe for the most volatile pages, if there are any). 

I think it does make sense.

 
 The obvious disadvantage would be a higher barrier of entry for people 
 providing docs (though prior to being whitelisted one would have to express 
 the intent to provide improvments on the mailing list anyway). The advantage 
 could be a clearer path towards committership for those not working on code 
 but on technical writing.

I like what we do over in Solr land, an official reference guide that is 
maintained by committers + patches and then a wiki which allows free editing 
(for the most part).  Things generally move from the wiki to the Ref guide.

 
 The only question concerning the move to Apache CMS I have: How easy is it to 
 provide documentation for individtual released versions? Would it be possible 
 to e.g. bundle the then current docs with the release?

It's all in SVN and is usually markdown.  Tag it and ship it!



Re: 0.9?

2013-09-30 Thread Grant Ingersoll
Hi Ted,

This sounds good to the extent we can get them done.  Do you have JIRA issues 
for any of these open?  November isn't hard and fast for 0.9, but I suspect it 
will be January if we push things out.

-Grant

On Sep 28, 2013, at 1:59 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 The one large-ish feature that I think would find general use would be a high 
 performance classifier trainer.  
 
 Flor cleanup sort of thing it would be good to fully integrate the streaming 
 k-means into the normal clustering commands while revamping the command line 
 API.  
 
 Dmitriy's recent scala work would help quite a bit before 1.0. Not sure it 
 can make 0.9. 
 
 For recommendations, I think that the demo system that pat started with the 
 elaborations by Ellen an Tim would be very good to have. 
 
 I would be happy to collaborate with somebody on these but am not at all 
 likely to have time to actually do them end to end. 
 
 Sent from my iPhone
 
 On Sep 28, 2013, at 12:40, Grant Ingersoll gsing...@apache.org wrote:
 
 Moving closer to 1.0, removing cruft, etc.  Do we have any more major 
 features planned for 1.0?  I think we said during 0.8 that we would try to 
 follow pretty quickly w/ another release.
 
 -Grant
 
 On Sep 28, 2013, at 12:33 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Sounds right in principle but perhaps a bit soon.  
 
 What would define the release?
 
 Sent from my iPhone
 
 On Sep 27, 2013, at 7:48, Grant Ingersoll gsing...@apache.org wrote:
 
 Anyone interested in thinking about 0.9 in the early Nov. time frame?
 
 -Grant
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Fwd: ASF Board Report - Initial Reminder for Oct 2013

2013-09-29 Thread Grant Ingersoll
FYI.  I'll circulate a draft this week.

Begin forwarded message:

 From: ASF Board bo...@apache.org
 Subject: ASF Board Report - Initial Reminder for Oct 2013
 Date: September 29, 2013 3:29:09 PM EDT
 To: Grant Ingersoll gsing...@apache.org
 
 
 
 This email was sent by an automated system on behalf of the ASF Board.
 It is an initial reminder to give you plenty of time to prepare the report.
 
 The meeting is scheduled for Wed, 16 October 2013, 10:30:00:00 PST and the 
 deadline for
 submitting your report is 1 full week prior to that (Wed, Oct 9th)!
 
 According to board records, you are listed as the chair of at least one
 committee that is due to submit a report this month. [1] [2]
 
 Details on which project reports are due and how to submit a report 
 are enclosed below.
 
 Please submit your report with sufficient time to allow the board members
 to review and digest. Again, the very latest you should submit your report
 is 1 full week (7days) prior to the board meeting (Wed, Oct 9th).
 
 If you feel that an error has been made, please consult [1] and if there
 is still an issue then contact the board directly.
 
 As always, PMC chairs are welcome to attend the board meeting.
 
 Thanks,
 The ASF Board
 
 [1] - https://svn.apache.org/repos/private/committers/board/committee-info.txt
 [2] - https://svn.apache.org/repos/private/committers/board/calendar.txt
 [3] - https://svn.apache.org/repos/private/committers/board/templates
 
 
 Submitting your Report
 --
 
 Full details about the process and schedule are in [1].
 
 The report should be committed to the meeting agenda in the board directory
 in the foundation repository, trying to keep a similar format to the others.
 This can be found at:
 
  https://svn.apache.org/repos/private/foundation/board
 
 Your report should also be sent in plain-text format to bo...@apache.org
 with a Subject line that follows the below format:
 
Subject: [REPORT] Project Name
 
 Cutting and pasting directly from a Wiki is not acceptable due to formatting
 issues. Line lengths should be limited to 77 characters.
 
 
 Resolutions
 ---
 
 There are several templates for use for various Board resolutions.
 They can be found in [3] and you are encouraged to use them. It is
 strongly recommended that if you have a resolution before the board,
 you are encouraged to attend that board meeting.
 
 
 ASF Board Reports
 -
 
 Reports are due from you for the following committees:
 
- Mahout


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: 0.9?

2013-09-28 Thread Grant Ingersoll

On Sep 27, 2013, at 9:07 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 I was gonna bring this up myself next week (and was chatting with Isabel 
 about it today morning).
 
 I was thinking of the following for 0.9:-
 
 1. We have already removed the algorithms that have been marked as deprecated 
 in 0.8
 2.  Bugs that have been fixed since 0.8.
 3.  New Features in 0.9 could include :-
 a) New Multilayer Perceptron that Yexi had contributed recently and is 
 presently pending review (don't know the JIRA# top of my head).  
 b)  Using Finite State Transducers as a dictionary type. I had opened a 
 Jira for this and an work on it.
  

Are you using Lucene's FSTs for this?

Rest sounds good.


 Anything else others would like to add???
 
 Grant, could we have a hangout the week of Oct 7 :) ??

I can't that week, but probably the following.

 
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org dev@mahout.apache.org 
 Sent: Friday, September 27, 2013 8:48 AM
 Subject: 0.9?
 
 
 Anyone interested in thinking about 0.9 in the early Nov. time frame?
 
 -Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: 0.9?

2013-09-28 Thread Grant Ingersoll
Moving closer to 1.0, removing cruft, etc.  Do we have any more major features 
planned for 1.0?  I think we said during 0.8 that we would try to follow pretty 
quickly w/ another release.

-Grant

On Sep 28, 2013, at 12:33 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Sounds right in principle but perhaps a bit soon.  
 
 What would define the release?
 
 Sent from my iPhone
 
 On Sep 27, 2013, at 7:48, Grant Ingersoll gsing...@apache.org wrote:
 
 Anyone interested in thinking about 0.9 in the early Nov. time frame?
 
 -Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com







0.9?

2013-09-27 Thread Grant Ingersoll
Anyone interested in thinking about 0.9 in the early Nov. time frame?

-Grant


word2vec

2013-09-26 Thread Grant Ingersoll
Anyone looked at: https://code.google.com/p/word2vec/

-Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Hangout on Monday

2013-08-07 Thread Grant Ingersoll

On Aug 5, 2013, at 9:30 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Mon, Aug 5, 2013 at 5:21 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:
 
 Grant had setup a biweekly/weekly Google Doodle for Mahout meetups.
 
 
 Can you say something about how to hijack some of those?

Doodle just allows you to capture when people are most available.  I believe if 
you check the link sent earlier, you can see when they are and how people 
voted.  Otherwise, I can dig it up.

-Grant

Re: lucene.vectors tool not working

2013-07-31 Thread Grant Ingersoll
Can you provide more details on what you ran?

Also, please ask on u...@mahout.apache.org in the future

Thanks,
Grant
On Jul 31, 2013, at 9:18 PM, Swami Kevala swami.kev...@ishafoundation.org 
wrote:

 I'm using Solr 4.4 and Mahout 0.8
 
 I'm getting the following error
 
 SEVERE: There are too many documents that do not have a term vector for text
 Exception in thread main java.lang.IllegalStateException: There are too 
 many 
 documents that do not have a term vector for text
at 
 org.apache.mahout.utils.vectors.lucene.AbstractLuceneIterator.computeNext(Abst
 ractLuceneIterator.java:97)
 
 I tried setting the parameter:  --maxPercentErrorDocs 0.9 and I still get the 
 same error.
 
 I have defined termvectors for my Solr 'text' field
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-07-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724029#comment-13724029
 ] 

Grant Ingersoll commented on MAHOUT-627:


Dhruv,

Any chance this can get done?

 Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
 -

 Key: MAHOUT-627
 URL: https://issues.apache.org/jira/browse/MAHOUT-627
 Project: Mahout
  Issue Type: Task
  Components: Classification
Affects Versions: 0.4, 0.5
Reporter: Dhruv Kumar
Assignee: Grant Ingersoll
  Labels: gsoc, gsoc2011, mahout-gsoc-11
 Fix For: 0.9

 Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch


 Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
 Model Training. 
 Student Name: Dhruv Kumar 
 Student E-mail: dku...@ecs.umass.edu 
 Organization/Project: Apache Mahout 
 Assigned Mentor: 
 Proposal Abstract: 
 The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
 because of its superior numerical stability and its ability to guarantee the 
 discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
 presence of incomplete training data. Currently, Apache Mahout has a 
 sequential implementation of the Baum-Welch which cannot be scaled to train 
 over large data sets. This restriction reduces the quality of training and 
 constrains generalization of the learned model when used for prediction. This 
 project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
 version using the Map-Reduce programming framework for enhanced model fitting 
 over large data sets. 
 Detailed Description: 
 Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
 for applications generating temporal or spatial sequential data. Relative 
 simplicity of implementation, combined with their ability to discover latent 
 domain knowledge have made them very popular in diverse fields such as DNA 
 sequence alignment, gene discovery, handwriting analysis, voice recognition, 
 computer vision, language translation and parts-of-speech tagging. 
 A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
 unobservable, hidden states emitting symbols from a finite observable 
 vocabulary set O according to a probabilistic model Theta. The parameters of 
 the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
 transition matrix of the hidden states of size |S| X |S|. The elements 
 a_(i,j) of A specify the probability of transitioning from a state i to state 
 j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
 elements b_(s, o) provide the probability that a symbol o will be emitted 
 from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
 determine the probability that the system starts in the hidden state s. The 
 transitions of hidden states are unobservable and follow the Markov property 
 of memorylessness. 
 Rabiner [1] defined three main problems for HMMs: 
 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
 observation sequence, determine the probability that the model generated the 
 observed sequence. This is useful for evaluating the quality of the model and 
 is solved using the so called Forward algorithm. 
 2. Decoding: Given the complete model (S, O, Theta) and an observation 
 sequence, determine the hidden state sequence which generated the observed 
 sequence. This can be viewed as an inference problem where the model and 
 observed sequence are used to predict the value of the unobservable random 
 variables. The backward algorithm, also known as the Viterbi decoding 
 algorithm is used for predicting the hidden state sequence. 
 3. Training: Given the set of hidden states S, the set of observation 
 vocabulary O and the observation sequence, determine the parameters (A, B, 
 Pi) of the model Theta. This problem can be viewed as a statistical machine 
 learning problem of model fitting to a large set of training data. The 
 Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
 the Viterbi training algorithm are commonly used for model fitting. 
 In general, the quality of HMM training can be improved by employing large 
 training vectors but currently, Mahout only supports sequential versions of 
 HMM trainers which are incapable of scaling.  Among the Viterbi and the 
 Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
 and a better candidate for a parallel

Re: 0.8

2013-07-25 Thread Grant Ingersoll
 entropy stuff in org.apache.mahout.math.stats.entropy
 
 If you are interested in supporting 1 or more of these algorithms, please 
 make it known on dev@mahout.apache.org and via JIRA issues that fix and/or 
 improve them. Please also provide 
 supporting evidence as to their effectiveness for you in production.
 
 1.0 PLANS
 
 Our plans as a community are to focus 0.9 on cleanup of bugs and the 
 removal of the code mentioned above and then to follow with a 1.0 
 release soon thereafter, at which point the community is committing to 
 the support of the algorithms packaged in the 1.0 for at least two minor
 versions after their release. In the case of removal, we will deprecate
 the functionality in the 1.(x+1) minor release and remove it in the 
 1.(x+2) release. For instance, if feature X is to be removed after the 
 1.2 release, it will be deprecated in 1.3 and removed in 1.4.
 {quote}
 
 [1] 
 http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?revision=1501110view=markup
 [2] 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.8%22]
 
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org dev@mahout.apache.org 
 Sent: Wednesday, July 24, 2013 7:51 AM
 Subject: 0.8
 
 
 0.8 artifacts are pushed to the mirror location.  I will send an official 
 announcement tomorrow.
 
 In the meantime, please review the release notes at: 
 https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8
 
 The new features/fixes section is pretty weak.
 
 -Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Apache Mahout 0.8 Released

2013-07-25 Thread Grant Ingersoll
The Apache Mahout PMC is pleased to announce the release of Mahout 0.8. 
Mahout's goal is to build scalable machine learning libraries focused 
primarily in the areas of collaborative filtering (recommenders), 
clustering and classification (known collectively as the 3Cs), as well as the 
necessary infrastructure to support those implementations including, but
not limited to, math packages for statistics, linear algebra and others
as well as Java primitive collections, local and distributed vector and
matrix classes and a variety of integrative code to work with popular 
packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache 
Cassandra and much more. The 0.8 release is mainly a clean up release in
preparation for an upcoming 1.0 release, but there are several 
significant new features, which are highlighted below.

To get started with Apache Mahout 0.8, download the release artifacts and 
signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central 
Maven repository. 

In addition to the release highlights and artifacts, please pay attention to 
the section labelled FUTURE PLANS below for more information about upcoming 
releases of Mahout.

As with any release, we wish to thank all of the users and contributors 
to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for 
individual credits, as there are too many to list here.

GETTING STARTED

In the release package, the examples directory contains several working 
examples of the core 
functionality available in Mahout. These can be run via scripts in the 
examples/bin directory and will prompt you for more information to help you try 
things out. Most examples do not need a Hadoop cluster in 
order to run.

RELEASE HIGHLIGHTS

The highlights of the Apache Mahout 0.8 release include, but are not 
limited to the list below. For further information, see the included 
CHANGELOG file.

- Numerous performance improvements to Vector and Matrix 
implementations, API's and their iterators (see also MAHOUT-1192, 
MAHOUT-1202)
- Numerous performance improvements to the recommender implementations 
(see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, 
MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264)
- MAHOUT-1088: Support for biased item-based recommender
- MAHOUT-1089: SGD matrix factorization for rating prediction with user and 
item biases
- MAHOUT-1106: Support for SVD++
- MAHOUT-944: Support for converting one or more Lucene storage indexes 
to SequenceFiles as well as an upgrade of the supported Lucene version 
to Lucene 4.3.1.
- MAHOUT-1154 and friends: New streaming k-means implementation that offers 
on-line (and fast) clustering
- MAHOUT-833: Make conversion to SequenceFiles Map-Reduce, 'seqdirectory' can 
now be run as a MapReduce job.
- MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of 
vector to hash (indexes or values).
- MAHOUT-884: Matrix Concat utility, presently only concatenates two matrices.
- MAHOUT-1244: Upgraded to use Lucene 4.3
- MAHOUT-1187: Upgraded to CommonsLang3
- MAHOUT-916: Speedup the Mahout build by making tests run in parallel.
- The usual bug fixes. See JIRA [2] for more
information on the 0.8 release.

A total of 218 separate JIRA issues are addressed in this release.

CONTRIBUTING

Mahout is always looking for contributions focused on the 3Cs. If you are 
interested in contributing, please see our contribution page, 
https://cwiki.apache.org/MAHOUT/how-to-contribute.html, on the Mahout wiki or 
contact us via email at dev@mahout.apache.org.

FUTURE PLANS

0.9

As the project moves towards a 1.0 release, the community is working to 
clean up and/or remove parts of the code base that are under-supported 
or that underperform as well as to better focus the energy and 
contributions on key algorithms that are proven to scale in production 
and have seen wide-spread adoption. To this end, in the next release, 
the project is planning on removing support for the following algorithms
unless there is sustained support and improvement of them before the 
next release.

The algorithms to be removed are:
- From Clustering:
Dirichlet
MeanShift
MinHash
Eigencuts

- From Classification (both are sequential implementations)
Winnow
Perceptron

- Frequent Pattern Mining

- Collaborative Filtering
All recommenders in org.apache.mahout.cf.taste.
impl.recommender.knn
SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and 
org.apache.mahout.cf.taste.impl.recommender.slopeone
Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender

- Mahout Math
Lanczos in favour of SSVD
Hadoop entropy stuff in org.apache.mahout.math.stats.entropy

If you are interested in supporting 1 or more of these algorithms, please make 
it known on dev@mahout.apache.org and via JIRA issues that fix and/or improve 
them. Please also provide 
supporting evidence as to their 

Re: 0.8

2013-07-25 Thread Grant Ingersoll

On Jul 25, 2013, at 11:08 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 What does it mean -- remove Mahout Math?

It is a high level bullet, see the items underneath.  Unfortunately, they don't 
translate to text format very well.

[jira] [Updated] (MAHOUT-1284) DummyRecordWriter's bug with reused Writables

2013-07-24 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1284:


Fix Version/s: (was: 0.8)
   (was: 0.7)
   0.9

 DummyRecordWriter's bug with reused Writables
 -

 Key: MAHOUT-1284
 URL: https://issues.apache.org/jira/browse/MAHOUT-1284
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
Priority: Minor
  Labels: test
 Fix For: 0.9

 Attachments: MAHOUT-1284.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 It is a recommended practice to reuse the Writable objects. 
 DummyRecordWriter, which is used for testing in Mahout, however keeps the 
 same Writable instance in a map: next time that the user reuses the Writable 
 object, the internal map of DummyRecordWriter changes as well. This makes 
 DummyRecordWriter fail for testing the MapReduce jobs that reuse the 
 Writables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


0.8

2013-07-24 Thread Grant Ingersoll
0.8 artifacts are pushed to the mirror location.  I will send an official 
announcement tomorrow.

In the meantime, please review the release notes at: 
https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8

The new features/fixes section is pretty weak.

-Grant

Re: [VOTE] Release Mahout 0.8

2013-07-19 Thread Grant Ingersoll
This passes.  I will finish off the release either tonight or tomorrow AM.

On Jul 19, 2013, at 3:06 AM, Jake Mannix jake.man...@gmail.com wrote:

 +1 from me, I used the jars to run some LDA (on a couple hundred million
 documents) on the work cluster (1.0.something small), and it worked fine.
 Other clustering example (with reuters) also worked as expected.
 
 
 
 On Thu, Jul 18, 2013 at 11:27 AM, Suneel Marthi 
 suneel_mar...@yahoo.comwrote:
 
 +1 from me.
 
 
 
 
 
 From: Sebastian Schelter s...@apache.org
 To: dev@mahout.apache.org
 Sent: Thursday, July 18, 2013 1:22 PM
 Subject: Re: [VOTE] Release Mahout 0.8
 
 
 +1 from me, recommender stuff worked fine in my tests
 
 
 2013/7/18 Grant Ingersoll gsing...@apache.org
 
 +1 from me.
 
 On Jul 16, 2013, at 4:52 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Applying a forcing function:
 
 Please vote on releasing the 0.8 artifacts at
 
 https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/
 .
 
 Release notes are at
 https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8
 
 [] +1 Looks good
 [] 0 - No opinion
 [] -1 Don't release
 
 Vote criteria from https://www.apache.org/dev/release.html
 
 What are the ASF requirements on approving a release?
 Votes on whether a package is ready to be released use majority
 approval
 -- i.e., at least three PMC members must vote affirmatively for release,
 and there must be more positive than negative votes. Releases may not be
 vetoed. Before voting +1 PMC members are required to download the signed
 source code package, compile it as provided, and test the resulting
 executable on their own platform, along with also verifying that the
 package meets the requirements of the ASF policy on releases.
 
 Thanks,
 Grant
 
 
 
 
 
 
 
 -- 
 
  -jake


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: [VOTE] Release Mahout 0.8

2013-07-18 Thread Grant Ingersoll
+1 from me.

On Jul 16, 2013, at 4:52 PM, Grant Ingersoll gsing...@apache.org wrote:

 Applying a forcing function:
 
 Please vote on releasing the 0.8 artifacts at 
 https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/.
   
 
 Release notes are at 
 https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8
 
 [] +1 Looks good
 [] 0 - No opinion
 [] -1 Don't release
 
 Vote criteria from https://www.apache.org/dev/release.html
 
 What are the ASF requirements on approving a release?
 Votes on whether a package is ready to be released use majority approval -- 
 i.e., at least three PMC members must vote affirmatively for release, and 
 there must be more positive than negative votes. Releases may not be vetoed. 
 Before voting +1 PMC members are required to download the signed source code 
 package, compile it as provided, and test the resulting executable on their 
 own platform, along with also verifying that the package meets the 
 requirements of the ASF policy on releases.
 
 Thanks,
 Grant




Re: mahout-distribution-0.8-src.tar.gz cannot be unpacked on Linux

2013-07-18 Thread Grant Ingersoll
If the artifacts don't work, this is a blocker.

On Jul 18, 2013, at 2:27 AM, Stevo Slavić ssla...@gmail.com wrote:

 Hello team,
 
 Just like binary distribution couldn't be unpacked (see
 MAHOUT-1229https://issues.apache.org/jira/browse/MAHOUT-1229),
 I've just discovered that mahout-distribution-0.8-src.tar.gz also cannot be
 unpacked (mahout executable cannot be unpacked to bin directory, bin
 directory permissions are not set). Zip distribution src archive can be
 unpacked.
 
 Fix is trivial, equivalent to the fix for MAHOUT-1229.
 
 Shall we just fix this in 0.9 or release new 0.8 RC with this fixed?
 
 Kind regards,
 Stevo Slavic.




Re: mahout-distribution-0.8-src.tar.gz cannot be unpacked on Linux

2013-07-18 Thread Grant Ingersoll

On Jul 18, 2013, at 1:23 PM, Grant Ingersoll gsing...@apache.org wrote:

 If the artifacts don't work, this is a blocker.

On 2nd thought, we could just doc that piece.

 
 On Jul 18, 2013, at 2:27 AM, Stevo Slavić ssla...@gmail.com wrote:
 
 Hello team,
 
 Just like binary distribution couldn't be unpacked (see
 MAHOUT-1229https://issues.apache.org/jira/browse/MAHOUT-1229),
 I've just discovered that mahout-distribution-0.8-src.tar.gz also cannot be
 unpacked (mahout executable cannot be unpacked to bin directory, bin
 directory permissions are not set). Zip distribution src archive can be
 unpacked.
 
 Fix is trivial, equivalent to the fix for MAHOUT-1229.
 
 Shall we just fix this in 0.9 or release new 0.8 RC with this fixed?
 
 Kind regards,
 Stevo Slavic.
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[VOTE] Release Mahout 0.8

2013-07-16 Thread Grant Ingersoll
Applying a forcing function:

Please vote on releasing the 0.8 artifacts at 
https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/.
  

Release notes are at 
https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8

[] +1 Looks good
[] 0 - No opinion
[] -1 Don't release

Vote criteria from https://www.apache.org/dev/release.html

What are the ASF requirements on approving a release?
Votes on whether a package is ready to be released use majority approval -- 
i.e., at least three PMC members must vote affirmatively for release, and there 
must be more positive than negative votes. Releases may not be vetoed. Before 
voting +1 PMC members are required to download the signed source code package, 
compile it as provided, and test the resulting executable on their own 
platform, along with also verifying that the package meets the requirements of 
the ASF policy on releases.

Thanks,
Grant

Re: Mahout release process

2013-07-15 Thread Grant Ingersoll

On Jul 14, 2013, at 7:27 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 I'd say go for it.  Of course, my preference would be that time spent on
 Mahout right now is focused on testing 0.8, but you are free to do as you
 wish.
 
 
 it looks good on my part. I found however that a bug was (re-?) introduced
 into UpperTriangular matrix( breaks row count property in certain form of
 constructor) which however did not seem to affect any of existing solvers.
 this is fixed as a part of M-1281

Do we need to respin?




Re: Mahout release process

2013-07-14 Thread Grant Ingersoll

On Jul 11, 2013, at 12:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Grant, so we have released then and can commit 0.9 issues to trunk now? or
 we are still frozen and waiting for final release steps? or release
 candidates?

I think you can, but the big unknown to me is how Maven handles rollbacks if 
something goes wrong.  I guess I can always pull the tag/branch and work off of 
that.


 
 because it is my understanding that after we have build 0.9 artifacts, we
 cannot build them again -- so we must have built final 0.9 then. If for
 some reason we are not happy with 0.9 artifacts we kind of have to build
 something like 0.9.1 but not 0.9 again...
 
 anyway i just want to know when it is ok to start pushing 0.9 things to
 master.

I'd say go for it.  Of course, my preference would be that time spent on Mahout 
right now is focused on testing 0.8, but you are free to do as you wish.


 
 Thank you, sir.
 
 -d
 
 
 On Thu, Jul 11, 2013 at 7:31 AM, Grant Ingersoll gsing...@apache.orgwrote:
 
 
 On Jul 10, 2013, at 5:05 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
 i thought maven release:prepare changes from 0.8-SNAPSHOT to 0.8 (and
 eliminates snapshot dependencies). and release:perform goes from 0.8 to
 0.9-SNAPSHOT. I.e. it guarantees that by the time you have 0.9-SNAPSHOT
 set, you also have a released 0.8 build.
 
 Correct.  The release artifacts are 0.8, no SNAPSHOT, trunk is 0.9-SNAPSHOT
 
 
 but for some reason it is not what is happening now on trunk.
 
 
 On Wed, Jul 10, 2013 at 10:06 AM, Jake Mannix jake.man...@gmail.com
 wrote:
 
 On Wed, Jul 10, 2013 at 10:00 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 That's how the maven release plugin does it in my experience, and yes
 that's what I get now too.
 
 
 Ok, that's fine if it's intended, but it seems to put us in a little
 bit of
 a weird state.  We tell
 our users often to build on trunk, so if they're using the current most
 recent release (0.7),
 then if they do that now, they go from 0.7 to 0.9-SNAPSHOT.  Not the
 end of
 the world,
 but this would be avoided if we were on a release branch, right?
 
 Maybe next time, we can do that?
 
 
 
 
 On Wed, Jul 10, 2013 at 10:54 AM, Jake Mannix jake.man...@gmail.com
 wrote:
 
 So quick question: is an intentional side-effect of the current
 release
 process that when we build on trunk now, we build artifacts named e.g.
 mahout-examples-0.9-SNAPSHOT-job.jar ?
 
 
 On Wed, Jul 10, 2013 at 2:33 AM, Sean Owen sro...@gmail.com wrote:
 
 Yes you can do all of this in a branch, which would let things
 continue to change on HEAD. Otherwise HEAD has to be frozen. I think
 here there's not enough velocity of change to make freezing HEAD that
 big of a deal, but yes you could manage the process yourself in a
 branch if you wanted to.
 
 Tags are changeable in SVN. Nobody is depending on the tag until
 after
 the release is finalized, so moving them during the release or
 reapplying them is no big thing.
 
 The release process doesn't update Maven artifacts, even snapshots,
 so
 the process does not affect what artifacts end users use.
 
 RCs are indeed all labeled x.y but are certainly distinguished by
 date, timestamp. It's not a RC in the sense that it may evolve and
 change in response to bug fixes over weeks or months -- it's either a
 valid build or it isn't right now, and is released or not in a few
 days unless there is a critical build problem. It will only be
 developers that might ever distinguish several builds.
 
 You can use x.y.z for sure and I personally would be happy to see
 0.8.0 used instead of 0.8. That is technically more standard
 Maven
 convention. I don't think there will be enough change / energy for
 point releases but it doesn't hurt to allow for the possibility.
 
 
 On Wed, Jul 10, 2013 at 10:11 AM, Stevo Slavić ssla...@gmail.com
 wrote:
 This is continuation of my and Grant's discussion on
 https://issues.apache.org/jira/browse/MAHOUT-1275 which I believe
 is
 better
 suited to be continued here on the dev mailing list.
 
 Apologies for my ignorance, if this discussion took place earlier
 in
 the
 project lifetime.
 
 
 There is no 0.8 branch here:
 http://svn.apache.org/viewvc/mahout/branches/
 maven-release-plugin:prepare creates a tag only, which (in svn)
 although
 similar to branch, shouldn't be modifiable, and we need it to be
 modifiable
 if something needs to be changed for final 0.8 release, without
 stopping/freezing 0.9 development.
 Release instructions basically state that if something is wrong
 with
 RC
 release, to delete RC release (drop staging repo and delete tag
 from
 svn),
 rollback version changes on trunk (from 0.9-SNAPSHOT back to
 0.8-SNAPSHOT),
 make a fix on trunk, and prepare/perform RC release again (same 0.8
 release
 version).
 Current 0.8 RC, IMO is not a proper RC - if we need to make a
 change
 to
 it
 and release another RC, there would be no obvious distinction
 between
 the
 two RCs, especially to Maven builds

Re: Mahout release process

2013-07-14 Thread Grant Ingersoll
I made a branch off of 0.8, so presumably any fixes can be made off of that and 
then we can retag as necessary.


On Jul 14, 2013, at 7:29 AM, Grant Ingersoll gsing...@apache.org wrote:

 
 On Jul 11, 2013, at 12:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
 Grant, so we have released then and can commit 0.9 issues to trunk now? or
 we are still frozen and waiting for final release steps? or release
 candidates?
 
 I think you can, but the big unknown to me is how Maven handles rollbacks if 
 something goes wrong.  I guess I can always pull the tag/branch and work off 
 of that.
 
 
 
 because it is my understanding that after we have build 0.9 artifacts, we
 cannot build them again -- so we must have built final 0.9 then. If for
 some reason we are not happy with 0.9 artifacts we kind of have to build
 something like 0.9.1 but not 0.9 again...
 
 anyway i just want to know when it is ok to start pushing 0.9 things to
 master.
 
 I'd say go for it.  Of course, my preference would be that time spent on 
 Mahout right now is focused on testing 0.8, but you are free to do as you 
 wish.
 
 
 
 Thank you, sir.
 
 -d
 
 
 On Thu, Jul 11, 2013 at 7:31 AM, Grant Ingersoll gsing...@apache.orgwrote:
 
 
 On Jul 10, 2013, at 5:05 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
 i thought maven release:prepare changes from 0.8-SNAPSHOT to 0.8 (and
 eliminates snapshot dependencies). and release:perform goes from 0.8 to
 0.9-SNAPSHOT. I.e. it guarantees that by the time you have 0.9-SNAPSHOT
 set, you also have a released 0.8 build.
 
 Correct.  The release artifacts are 0.8, no SNAPSHOT, trunk is 0.9-SNAPSHOT
 
 
 but for some reason it is not what is happening now on trunk.
 
 
 On Wed, Jul 10, 2013 at 10:06 AM, Jake Mannix jake.man...@gmail.com
 wrote:
 
 On Wed, Jul 10, 2013 at 10:00 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 That's how the maven release plugin does it in my experience, and yes
 that's what I get now too.
 
 
 Ok, that's fine if it's intended, but it seems to put us in a little
 bit of
 a weird state.  We tell
 our users often to build on trunk, so if they're using the current most
 recent release (0.7),
 then if they do that now, they go from 0.7 to 0.9-SNAPSHOT.  Not the
 end of
 the world,
 but this would be avoided if we were on a release branch, right?
 
 Maybe next time, we can do that?
 
 
 
 
 On Wed, Jul 10, 2013 at 10:54 AM, Jake Mannix jake.man...@gmail.com
 wrote:
 
 So quick question: is an intentional side-effect of the current
 release
 process that when we build on trunk now, we build artifacts named e.g.
 mahout-examples-0.9-SNAPSHOT-job.jar ?
 
 
 On Wed, Jul 10, 2013 at 2:33 AM, Sean Owen sro...@gmail.com wrote:
 
 Yes you can do all of this in a branch, which would let things
 continue to change on HEAD. Otherwise HEAD has to be frozen. I think
 here there's not enough velocity of change to make freezing HEAD that
 big of a deal, but yes you could manage the process yourself in a
 branch if you wanted to.
 
 Tags are changeable in SVN. Nobody is depending on the tag until
 after
 the release is finalized, so moving them during the release or
 reapplying them is no big thing.
 
 The release process doesn't update Maven artifacts, even snapshots,
 so
 the process does not affect what artifacts end users use.
 
 RCs are indeed all labeled x.y but are certainly distinguished by
 date, timestamp. It's not a RC in the sense that it may evolve and
 change in response to bug fixes over weeks or months -- it's either a
 valid build or it isn't right now, and is released or not in a few
 days unless there is a critical build problem. It will only be
 developers that might ever distinguish several builds.
 
 You can use x.y.z for sure and I personally would be happy to see
 0.8.0 used instead of 0.8. That is technically more standard
 Maven
 convention. I don't think there will be enough change / energy for
 point releases but it doesn't hurt to allow for the possibility.
 
 
 On Wed, Jul 10, 2013 at 10:11 AM, Stevo Slavić ssla...@gmail.com
 wrote:
 This is continuation of my and Grant's discussion on
 https://issues.apache.org/jira/browse/MAHOUT-1275 which I believe
 is
 better
 suited to be continued here on the dev mailing list.
 
 Apologies for my ignorance, if this discussion took place earlier
 in
 the
 project lifetime.
 
 
 There is no 0.8 branch here:
 http://svn.apache.org/viewvc/mahout/branches/
 maven-release-plugin:prepare creates a tag only, which (in svn)
 although
 similar to branch, shouldn't be modifiable, and we need it to be
 modifiable
 if something needs to be changed for final 0.8 release, without
 stopping/freezing 0.9 development.
 Release instructions basically state that if something is wrong
 with
 RC
 release, to delete RC release (drop staging repo and delete tag
 from
 svn),
 rollback version changes on trunk (from 0.9-SNAPSHOT back to
 0.8-SNAPSHOT),
 make a fix on trunk, and prepare/perform RC release again (same 0.8
 release

Re: Mahout release process

2013-07-11 Thread Grant Ingersoll
.x after final release could have 0.8.1-SNAPSHOT
 version,
 for any critical support changes in future, before 0.9 release.
 During whole time of forging 0.8 RC and final releases on their own
 0.8.x
 branch, 0.9-SNAPSHOT development on trunk can go on. Also, there
 would
 be
 no rollbacks necessary for RC releases (with exception of cases
 when
 it's
 really necessary, e.g. when release of some RC is incomplete/breaks
 because
 of network failure or something similar). Also tags stay
 non-modifiable.
 
 I noticed at least one Apache project to follow this release
 workflow
 (with
 staging RCs with different Maven coordinates, and promoting an RC
 to
 final
 release), namely on Apache HttpComponents project.
 
 I could understand current release process, if idea is to have all
 hands
 focused on the release while it's being made/tested, and also
 making
 it
 obvious to community (with absence of branches other than trunk)
 that
 there
 is no support whatsoever possible/available via minor releases,
 apart
 from
 changes on trunk and next major release.
 
 Kind regards,
 Stevo Slavić.
 
 
 
 
 --
 
  -jake
 
 
 
 
 
 --
 
  -jake
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: EigenDecomposition

2013-07-11 Thread Grant Ingersoll
FWIW, the only way we are getting out of code freeze is if we actually get some 
feedback on the RC.  It passes my tests, but I haven't heard from others much.

-Grant

On Jul 10, 2013, at 5:13 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 meant, after code freeze is over.
 
 
 On Wed, Jul 10, 2013 at 2:13 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
 fixed as part of MAHOUT-1281 patch now. I will push after code freeze.
 
 
 On Wed, Jul 10, 2013 at 2:06 PM, Ted Dunning ted.dunn...@gmail.comwrote:
 
 Please file.  Looks completely innocuous and it is good to be standard.
 
 
 On Wed, Jul 10, 2013 at 12:59 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
 Looks like Lanczos is having the same problem and need to undo some
 workarounds :
 
EigenDecomposition decomp = new EigenDecomposition(triDiag);
 
Matrix eigenVects = decomp.getV();
Vector eigenVals = decomp.getRealEigenvalues();
endTime(TimingSection.TRIDIAG_DECOMP);
startTime(TimingSection.FINAL_EIGEN_CREATE);
for (int row = 0; row  i; row++) {
  Vector realEigen = null;
  // the eigenvectors live as columns of V, in reverse order.  Weird
 but true.
  Vector ejCol = eigenVects.viewColumn(i - row - 1);
  int size = Math.min(ejCol.size(), state.getBasisSize());
 
 
 
 On Wed, Jul 10, 2013 at 12:53 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
 changing line 329 of EigenDecomposition.java from
 
if (d.getQuick(j)  p) {
 
 to
if (d.getQuick(j)  p) {
 
 
 makes my MAHOUT-1281 patch work.
 
 should i keep the change? (question for Ted, i guess)
 
 thanks.
 -D
 
 
 
 
 On Wed, Jul 10, 2013 at 11:59 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
 It looks like values out of our ported EigenDecomposition are coming
 out
 sorted in inverse order.
 
 Shouldn't it be the other way around?
 
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13703155#comment-13703155
 ] 

Grant Ingersoll commented on MAHOUT-1275:
-

[~sslavic]  Yeah, Maven release does create the branch and that is the workflow 
I usually use as well.  The main issue I have, is it seems like the Maven 
release goal has to rollback things if for some reason there are issues w/ the 
RC, but perhaps that is just our misunderstanding of how to use the Maven 
release goal.  Please have a look at 
https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Release to see if our 
understanding of that is right.

 Drop some of the Release Artifact File Types
 

 Key: MAHOUT-1275
 URL: https://issues.apache.org/jira/browse/MAHOUT-1275
 Project: Mahout
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Stevo Slavic
Priority: Minor
 Fix For: 0.9


 There really is no reason why we need so many release artifacts for the 
 distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, 
 let's save a few bits, along with Release Manager upload times, and drop the 
 BZ2 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: AWS test bed

2013-07-09 Thread Grant Ingersoll
That sounds cool.  I think one of the keys is making it easy to spin up and 
test our stuff there.


On Jul 9, 2013, at 1:36 PM, Andrew Musselman andrew.mussel...@gmail.com wrote:

 Just got a promo code from the AWS team that will buy $1,000 of their
 services.
 
 
 On Tue, Jul 9, 2013 at 10:52 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 One of the things we chatted about last night in the hangout was how to
 automate this regression process.
 
 I reached out to our friends at Amazon Web Services, who are looking at
 how they could donate compute time so we could use a cluster as well
 regressing on our own hosts.
 
 We could either spin things up and run things manually or write some
 scripts to do it; in any caseI will keep you posted on what develops.
 
 Best
 Andrew
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: (Bi-)Weekly/Monthly Dev Sessions

2013-07-09 Thread Grant Ingersoll
No worries, next time!

It was a good first attempt, mainly focused around testing 0.8 and getting up 
to speed on what needs to be tested.

The next time I am available is August 5th.  If others want to meet before 
then, please do, otherwise, I will send out a reminder closer to the 5th.

On Jul 9, 2013, at 12:39 PM, Peng Cheng pc...@uowmail.edu.au wrote:

 Sorry I missed the meeting, I really want to listen to your discussion but 
 yesterday a thunderstorm cut off my electricity.
 
 On 13-07-08 08:29 PM, Andrew Musselman wrote:
 I'm getting an error when I build after doing svn up:
 
 $ mvn package
 [INFO] Scanning for projects...
 [ERROR] The build could not read 1 project - [Help 1]
 [ERROR]
 [ERROR]   The project  (/home/akm/mahout/pom.xml) has 1 error
 [ERROR] Non-readable POM /home/akm/mahout/pom.xml: no more data
 available - expected end tag /project to close start tag project from
 line 2, parser stopped on END_TAG seen .../reporting\n/project\n...
 @1030:1
 
 But there's a /project tag at the end of that..
 
 
 On Mon, Jul 8, 2013 at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 Hmm, seems like that old link doesn't work.  Here's a new one:
 https://plus.google.com/hangouts/_/899b63ca1b3864c749886348cdddfcd80d00bb0b?hl=en
 
 -Grant
 
 On Jul 7, 2013, at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 How about tomorrow (Monday) night at 8:30 pm EDT?
 
 Anyone who wants to join, can browse to
 https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en
  If for some reason that doesn't work, ping me on IRC (gsingers) in the
 #mahout channel on Freenode.
 
 Agenda:
 
 0.8 Release Testing
 
 -Grant
 
 
 On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 Is today's Hangout happening?
 
 
 On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 Hi,
 
 One of the things we kicked around at Buzzwords was having a
 weekly/bi-weekly/monthly dev session via Google hangout (Drill does
 this
 with good success, I believe).  Since we are so spread out, I thought
 I
 would throw out a Doodle (scheduling tool for those unfamiliar) to see
 what
 times work best for the majority of people interested in such a thing.
   Anyone is free to participate, but this is not a Q and A session,
 but is
 instead focused on writing code, fixing bugs, triaging JIRA,
 releasing,
 etc.
 If you are interested, please fill out
 http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time
 Zone
 since I did the poll!)  I just
 grabbed a sampling of hours throughout the day.  I also picked 1 week
 as
 being representative of this being on a repeating schedule.  If none
 of
 the
 times work for you, but you are still interested, please respond
 here.  I
 would imagine we would meet for 1-2 hours.
 
 Also, please reply with the frequency at which you would like to meet:
 
 []  Weekly
 []  Bi-weekly (every 2 weeks)
 []  Monthly
 
 My vote is every two weeks.
 
 -Grant
 
 
 
 --
 Thanks,
 Pradeep
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: 0.8 progress

2013-07-09 Thread Grant Ingersoll
Any feedback yet on the RCs?

-Grant

On Jul 8, 2013, at 1:51 PM, Peng Cheng pc...@uowmail.edu.au wrote:

 Hi Sebastian,
 
 I'm sorry for the entirely noobish questions: where can I download the 
 judging.txt ground truth set? (netflix is pulling it off everywhere, so far I 
 can only get the legacy trainingSet and qualifying.txt)
 and how do I inject the ParallelAlsFactorizationJob into a common recommender 
 class?
 I was trying to reproduce your result (I own a small cluster), but don't even 
 know where to start. The only related thing i found in mahout-example is a 
 format converter.
 
 Thanks a lot if you can give me a hint.
 
 - Yours Peng
 
 On 13-07-01 01:24 AM, Sebastian Schelter wrote:
 I successfully ran the ALS and cooccurrence-based recommenders on the
 Netflix dataset on a 26 machine cluster using Hadoop 1.0.4.
 
 --sebastian
 
 
 On 28.06.2013 21:31, Jake Mannix wrote:
 I can run LDA on Twitter's cluster, on both reuters and some real data,
 as well as LR/SGD.
 
 
 On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll 
 gsing...@apache.orgwrote:
 
 We really should setup a VM that we can run a couple of nodes (perhaps at
 ASF?) on that we can share w/ everyone that makes it easy to test our stuff
 on Hadoop for the specific version that we ship.
 
 On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote:
 
 Can someone (if you have time and experience). Write a small shim to run
 all examples one after the other on a cluster and write up instructions
 on
 how to do it.?
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org
 wrote:
 Its crucial that we retest everything on a real cluster before the
 release.
 I will do this for the recommenders code next week.
 
 --sebastian
 Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org:
 
 I should have time next week to do the release, if we can get these
 knocked out.  If not next week, the following.
 
 On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
 1. Could someone look at Mahout-1257? There is a patch that's been
 submitted but I am not sure if this has been superseded by Sean's
 against
 Mahout-1239.
 2. Stevo, I am for fixing the findbugs excludes as part of 0.8
 release,
 I see that the number of warnings has gone up over the last few builds.
 3. I am more concerned about the cause of the mysterious cosmic rays
 that randomly fail unit tests (since we have moved to running parallel
 tests).  I see that happening on my local repository too.
 
 
 
 
 From: Stevo Slavić ssla...@gmail.com
 To: dev@mahout.apache.org
 Sent: Friday, June 28, 2013 3:21 AM
 Subject: Re: 0.8 progress
 
 
 Well done team!
 
 Build is unstable, oscillates, IMO regardless of changes made. Judging
 from
 logs I suspect that some of the Jenkins nodes are not configured well,
 /tmp
 directory security related issues, and file size constraints. Could be
 also
 issue with our tests.
 
 Javadoc was reported earlier not to be OK (not all modules in
 aggregated
 javadoc), and code quality reports are not working OK, e.g. findbugs
 doesn't respect excludes - plan to work on this during weekend.
 
 Do we want to fix these before or after 0.8 release?
 
 Kind regards,
 Stevo Slavić.
 
 
 On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com
 wrote:
 All Done
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com
 wrote:
 I sent the comments. The code is good. But without the matrix/vector
 input
 we cant ship it in the release. Hope Yiqun and Da Zhang can make
 those
 changes quickly.
 
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll 
 gsing...@apache.org
 wrote:
 
 I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any
 chance
 we
 can finish this up this week?
 
 -Grant
 
 On Jun 23, 2013, at 9:26 AM, Suneel Marthi 
 suneel_mar...@yahoo.com
 wrote:
 
 Finally got to finishing up M-833, the changes can be reviewed at
 https://reviews.apache.org/r/11774/diff/3/.
 
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org
 Sent: Tuesday, June 11, 2013 10:09 AM
 Subject: Re: 0.8 progress
 
 
 I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
 Thursday, I can roll an RC on Thursday.
 -Grant
 
 On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 Down to 4 issues!  I would say what they are, but JIRA is flaking
 out
 again.
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been
 working hard to get M-833 in.  Not sure on M-1214, Robin?
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll 
 gsing...@apache.org
 wrote:
 M-1067 -- Dmitriy  --  This is an enhancement, should we push

[jira] [Created] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-08 Thread Grant Ingersoll (JIRA)
Grant Ingersoll created MAHOUT-1275:
---

 Summary: Drop some of the Release Artifact File Types
 Key: MAHOUT-1275
 URL: https://issues.apache.org/jira/browse/MAHOUT-1275
 Project: Mahout
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.9


There really is no reason why we need so many release artifacts for the 
distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, let's 
save a few bits, along with Release Manager upload times, and drop the BZ2 
format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702370#comment-13702370
 ] 

Grant Ingersoll commented on MAHOUT-1275:
-

Stevo, just FYI, please don't commit anything right now, as we are under code 
freeze until 0.8 is out (unless you know how to deal w/ this in Maven release 
plugin)

 Drop some of the Release Artifact File Types
 

 Key: MAHOUT-1275
 URL: https://issues.apache.org/jira/browse/MAHOUT-1275
 Project: Mahout
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Stevo Slavic
Priority: Minor
 Fix For: 0.9


 There really is no reason why we need so many release artifacts for the 
 distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, 
 let's save a few bits, along with Release Manager upload times, and drop the 
 BZ2 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702458#comment-13702458
 ] 

Grant Ingersoll commented on MAHOUT-1275:
-

[~sslavic] Please revert this.  We are under code freeze right now on trunk.

 Drop some of the Release Artifact File Types
 

 Key: MAHOUT-1275
 URL: https://issues.apache.org/jira/browse/MAHOUT-1275
 Project: Mahout
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Stevo Slavic
Priority: Minor
 Fix For: 0.9


 There really is no reason why we need so many release artifacts for the 
 distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, 
 let's save a few bits, along with Release Manager upload times, and drop the 
 BZ2 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: (Bi-)Weekly/Monthly Dev Sessions

2013-07-08 Thread Grant Ingersoll
Hmm, seems like that old link doesn't work.  Here's a new one: 
https://plus.google.com/hangouts/_/899b63ca1b3864c749886348cdddfcd80d00bb0b?hl=en

-Grant

On Jul 7, 2013, at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote:

 How about tomorrow (Monday) night at 8:30 pm EDT?  
 
 Anyone who wants to join, can browse to 
 https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en
   If for some reason that doesn't work, ping me on IRC (gsingers) in the 
 #mahout channel on Freenode.
 
 
 Agenda:
 
 0.8 Release Testing
 
 -Grant
 
 
 On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Is today's Hangout happening?
 
 
 
 On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Hi,
 
 One of the things we kicked around at Buzzwords was having a
 weekly/bi-weekly/monthly dev session via Google hangout (Drill does this
 with good success, I believe).  Since we are so spread out, I thought I
 would throw out a Doodle (scheduling tool for those unfamiliar) to see
 what
 times work best for the majority of people interested in such a thing.
   Anyone is free to participate, but this is not a Q and A session, but is
 instead focused on writing code, fixing bugs, triaging JIRA, releasing,
 etc.
 
 If you are interested, please fill out
 http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time Zone
 since I did the poll!)  I just
 grabbed a sampling of hours throughout the day.  I also picked 1 week as
 being representative of this being on a repeating schedule.  If none of
 the
 times work for you, but you are still interested, please respond here.  I
 would imagine we would meet for 1-2 hours.
 
 Also, please reply with the frequency at which you would like to meet:
 
 []  Weekly
 []  Bi-weekly (every 2 weeks)
 []  Monthly
 
 My vote is every two weeks.
 
 -Grant
 
 
 
 
 --
 Thanks,
 Pradeep
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Jenkins build is back to normal : Mahout-Quality #2128

2013-07-07 Thread Grant Ingersoll

On Jul 6, 2013, at 4:38 PM, Stevo Slavić ssla...@gmail.com wrote:

 What did the trick (as of r1500216) for last two builds to be successful
 was serializing unit tests. At least some of them it seems are not designed
 to run in parallel (they very likely share some state), and they were
 running in parallel (1.5 per CPU core of Jenkins node on which build is
 running), causing each other to fail randomly. Now it's all sequential.

So, we undid the parallel builds?  Do you have a sense of the ones that were 
causing problems?

-G

Re: Code Freeze for 0.8

2013-07-07 Thread Grant Ingersoll
Working on the release now.  If anyone wants to join in, I'm on IRC as well.

-Grant


On Jul 5, 2013, at 12:40 PM, Sebastian Schelter s...@apache.org wrote:

 +1
 
 On 05.07.2013 18:06, Jake Mannix wrote:
 +1
 
 
 
 On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 +1
 
 
 On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
 +1
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org dev@mahout.apache.org
 Sent: Friday, July 5, 2013 10:36 AM
 Subject: Code Freeze for 0.8
 
 
 I know it's short notice, but I'd like to suggest a code freeze for 0.8
 today or tomorrow and I will do a 0.8 RC on Sunday.  Based on JIRA, etc.,
 it looks like this should be fine, but let me know if there are any
 objections.
 
 Thanks,
 Grant
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: (Bi-)Weekly/Monthly Dev Sessions

2013-07-07 Thread Grant Ingersoll
How about tomorrow (Monday) night at 8:30 pm EDT?  

Anyone who wants to join, can browse to 
https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en
  If for some reason that doesn't work, ping me on IRC (gsingers) in the 
#mahout channel on Freenode.


Agenda:

0.8 Release Testing

-Grant


On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Is today's Hangout happening?
 
 
 
 On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Hi,
 
 One of the things we kicked around at Buzzwords was having a
 weekly/bi-weekly/monthly dev session via Google hangout (Drill does this
 with good success, I believe).  Since we are so spread out, I thought I
 would throw out a Doodle (scheduling tool for those unfamiliar) to see
 what
 times work best for the majority of people interested in such a thing.
   Anyone is free to participate, but this is not a Q and A session, but is
 instead focused on writing code, fixing bugs, triaging JIRA, releasing,
 etc.
 
 If you are interested, please fill out
 http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time Zone
 since I did the poll!)  I just
 grabbed a sampling of hours throughout the day.  I also picked 1 week as
 being representative of this being on a repeating schedule.  If none of
 the
 times work for you, but you are still interested, please respond here.  I
 would imagine we would meet for 1-2 hours.
 
 Also, please reply with the frequency at which you would like to meet:
 
 []  Weekly
 []  Bi-weekly (every 2 weeks)
 []  Monthly
 
 My vote is every two weeks.
 
 -Grant
 
 
 
 
 --
 Thanks,
 Pradeep


Grant Ingersoll | @gsingers
http://www.lucidworks.com







0.8 Release Notes

2013-07-07 Thread Grant Ingersoll
Please add/edit/delete/extend: 
https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8

Artifacts for the RC are uploading as I type.

-Grant

Code Freeze for 0.8

2013-07-05 Thread Grant Ingersoll
I know it's short notice, but I'd like to suggest a code freeze for 0.8 today 
or tomorrow and I will do a 0.8 RC on Sunday.  Based on JIRA, etc., it looks 
like this should be fine, but let me know if there are any objections.

Thanks,
Grant

Re: In-Mapper combiner design pattern

2013-06-30 Thread Grant Ingersoll
Just  coming back to this...

On Jun 12, 2013, at 5:38 PM, DB Tsai dbt...@dbtsai.com wrote:

 Hi,
 
 For scalable SVM, since our codebase is quite different from mahout,
 it may take some time to refactorize it to work in mahout.

Note, the community may be able to help, here, if you put up a patch, then 
others likely will jump on and help.   Your call, of course.

Food for thought,
Grant

Re: 0.8 progress

2013-06-28 Thread Grant Ingersoll
I should have time next week to do the release, if we can get these knocked 
out.  If not next week, the following.

On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 1. Could someone look at Mahout-1257? There is a patch that's been submitted 
 but I am not sure if this has been superseded by Sean's against Mahout-1239.
 
 2. Stevo, I am for fixing the findbugs excludes as part of 0.8 release, I see 
 that the number of warnings has gone up over the last few builds. 
 
 3. I am more concerned about the cause of the mysterious cosmic rays that 
 randomly fail unit tests (since we have moved to running parallel tests).  I 
 see that happening on my local repository too.  
 
 
 
 
 
 From: Stevo Slavić ssla...@gmail.com
 To: dev@mahout.apache.org 
 Sent: Friday, June 28, 2013 3:21 AM
 Subject: Re: 0.8 progress
 
 
 Well done team!
 
 Build is unstable, oscillates, IMO regardless of changes made. Judging from
 logs I suspect that some of the Jenkins nodes are not configured well, /tmp
 directory security related issues, and file size constraints. Could be also
 issue with our tests.
 
 Javadoc was reported earlier not to be OK (not all modules in aggregated
 javadoc), and code quality reports are not working OK, e.g. findbugs
 doesn't respect excludes - plan to work on this during weekend.
 
 Do we want to fix these before or after 0.8 release?
 
 Kind regards,
 Stevo Slavić.
 
 
 On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com wrote:
 
 All Done
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com wrote:
 
 I sent the comments. The code is good. But without the matrix/vector
 input
 we cant ship it in the release. Hope Yiqun and Da Zhang can make those
 changes quickly.
 
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any chance
 we
 can finish this up this week?
 
 -Grant
 
 On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
 Finally got to finishing up M-833, the changes can be reviewed at
 https://reviews.apache.org/r/11774/diff/3/.
 
 
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org
 Sent: Tuesday, June 11, 2013 10:09 AM
 Subject: Re: 0.8 progress
 
 
 I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
 Thursday, I can roll an RC on Thursday.
 
 -Grant
 
 On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Down to 4 issues!  I would say what they are, but JIRA is flaking out
 again.
 
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been
 working hard to get M-833 in.  Not sure on M-1214, Robin?
 
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 M-1067 -- Dmitriy  --  This is an enhancement, should we push?
 
 Looks like this was committed already.
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: 0.8 progress

2013-06-28 Thread Grant Ingersoll
We really should setup a VM that we can run a couple of nodes (perhaps at ASF?) 
on that we can share w/ everyone that makes it easy to test our stuff on Hadoop 
for the specific version that we ship.

On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote:

 Can someone (if you have time and experience). Write a small shim to run
 all examples one after the other on a cluster and write up instructions on
 how to do it.?
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org wrote:
 
 Its crucial that we retest everything on a real cluster before the release.
 I will do this for the recommenders code next week.
 
 --sebastian
 Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org:
 
 I should have time next week to do the release, if we can get these
 knocked out.  If not next week, the following.
 
 On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
 1. Could someone look at Mahout-1257? There is a patch that's been
 submitted but I am not sure if this has been superseded by Sean's against
 Mahout-1239.
 
 2. Stevo, I am for fixing the findbugs excludes as part of 0.8 release,
 I see that the number of warnings has gone up over the last few builds.
 
 3. I am more concerned about the cause of the mysterious cosmic rays
 that randomly fail unit tests (since we have moved to running parallel
 tests).  I see that happening on my local repository too.
 
 
 
 
 
 From: Stevo Slavić ssla...@gmail.com
 To: dev@mahout.apache.org
 Sent: Friday, June 28, 2013 3:21 AM
 Subject: Re: 0.8 progress
 
 
 Well done team!
 
 Build is unstable, oscillates, IMO regardless of changes made. Judging
 from
 logs I suspect that some of the Jenkins nodes are not configured well,
 /tmp
 directory security related issues, and file size constraints. Could be
 also
 issue with our tests.
 
 Javadoc was reported earlier not to be OK (not all modules in
 aggregated
 javadoc), and code quality reports are not working OK, e.g. findbugs
 doesn't respect excludes - plan to work on this during weekend.
 
 Do we want to fix these before or after 0.8 release?
 
 Kind regards,
 Stevo Slavić.
 
 
 On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com
 wrote:
 
 All Done
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com
 wrote:
 
 I sent the comments. The code is good. But without the matrix/vector
 input
 we cant ship it in the release. Hope Yiqun and Da Zhang can make
 those
 changes quickly.
 
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll 
 gsing...@apache.org
 wrote:
 
 I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any
 chance
 we
 can finish this up this week?
 
 -Grant
 
 On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com
 
 wrote:
 
 Finally got to finishing up M-833, the changes can be reviewed at
 https://reviews.apache.org/r/11774/diff/3/.
 
 
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org
 Sent: Tuesday, June 11, 2013 10:09 AM
 Subject: Re: 0.8 progress
 
 
 I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
 Thursday, I can roll an RC on Thursday.
 
 -Grant
 
 On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Down to 4 issues!  I would say what they are, but JIRA is flaking
 out
 again.
 
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been
 working hard to get M-833 in.  Not sure on M-1214, Robin?
 
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org
 
 wrote:
 
 M-1067 -- Dmitriy  --  This is an enhancement, should we push?
 
 Looks like this was committed already.
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691954#comment-13691954
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

Hi,

Any progress on this?  It is the last open issue for 0.8.

Thanks,
Grant


 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-24 Thread Grant Ingersoll
I'd really like to, but had a trip come up.  If possible, can we push for one 
week?  Otherwise, if others want to go forward, I can try to set things up and 
share it w/ others.

On Jun 24, 2013, at 6:35 PM, Bhaskar Mookerji mooke...@spin-one.org wrote:

 Hi!
 
 Is the Google hangouts dev session tomorrow/Tuesday still happening?
 
 Lurkingly,
 Buro Mookerji
 
 
 On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote:
 
 It seems to be that 6 pm ET is the consensus time for the majority of
 people, although my having screwed up the poll didn't help.
 
 Bi-weekly is the other consensus.  It also looks like Tuesday or Thursday
 are the preferred dates.
 
 I can't make next week, so I'm going to propose we kick off on Tuesday,
 June 25 at 6 pm.  That will give us time to dry-run the Google Hangouts,
 etc.
 
 Again, just to be clear, the goal here is to work on the development of
 Mahout, not to answer questions about how to run Mahout (we could do that
 separately if there is a desire.)
 
 I'll send out a reminder as we get closer.
 
 -Grant
 
 
 On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
 I am from Northern Virginia, how many of us here are from the Washington
 DC Metro area?
 
 
 
 
 
 From: Jake Mannix jake.man...@gmail.com
 To: dev@mahout.apache.org dev@mahout.apache.org
 Sent: Wednesday, June 12, 2013 1:56 PM
 Subject: Re: (Bi-)Weekly/Monthly Dev Sessions
 
 
 Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon
 when I get back from europe at the end of the summer!
 
 
 On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Bi-weekly is good for me; I'm in Seattle and just filled out the poll.
 
 Great idea!
 
 
 On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com
 wrote:
 
 +1, am in Seattle as well and would love to attend and be involved.
 
 Sent from my iPhone
 
 On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
 wrote:
 
 Good idea on recurring meetings. Im very interested in participating.
 Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.
 
 An agenda for the meetings ahead of time will help us get the most of
 our
 time at the meetings.
 
 Thanks.
 On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
 On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu
 wrote:
 
 Angel and Suneel, you may want to re-fill out the new doodle.
 
 FYI, this week won't be representative of my schedule; I'm in the
 last
 few weeks of a job at ORNL where I travel every weekend. Normally
 I'll
 have
 more flexibility than just 6pm on weeknights.
 
 Yeah, Doodle makes you pick dates, but I just want it to be
 representative
 a week long period of time and not tied to a specific set of dates.
  So,
 just put in what your ideal times are in general and ignore the fact
 that
 it is set to next week.
 
 
 On 6/12/13 8:26 AM, Grant Ingersoll wrote:
 On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu
 wrote:
 
 +1, awesome idea
 
 One question: the poll, while set to GMT -5, does say it's in
 Central
 Time. Is this a daylight savings thing?
 I turned on Time Zone support, so not sure how it will look to
 others,
 but it sounds like it adjusts based on your location...  I see: 8 am,
 10,
 1, so on.
 
 I also realize, that I messed it up.  I meant 9 pm, not 9 am.
 
 Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 
 
 
 
 --
 
  -jake
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272

2013-06-24 Thread Grant Ingersoll
Can someone w/ more Hadoop experience look at this?  We are getting:

java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit 
cannot be cast to org.apache.hadoop.mapred.InputSplit
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)

AFAICT, we are using the new APIs, but this seems to think it should be the old 
APIs. Note, this is an intermittent issue.  Sometimes it goes through just 
fine.  Locally, it passes for me.

Note, this could also be related to the Parallel tests stuff.

-Grant

On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server jenk...@builds.apache.org 
wrote:

 Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec  
 FAILURE!
 testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest)  
 Time elapsed: 1.268 sec   FAILURE!
 org.junit.ComparisonFailure: 
 expected:TEST/subdir/[mail-messages].gz/u...@example.com but 
 was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com
   at org.junit.Assert.assertEquals(Assert.java:115)
   at org.junit.Assert.assertEquals(Assert.java:144)
   at 
 org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108)


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272

2013-06-24 Thread Grant Ingersoll
Never mind the noise here, I misread this!

Still, we have some error going on w/ random failures.

On Jun 24, 2013, at 8:33 PM, Grant Ingersoll gsing...@apache.org wrote:

 Can someone w/ more Hadoop experience look at this?  We are getting:
 
 java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit 
 cannot be cast to org.apache.hadoop.mapred.InputSplit
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
 
 AFAICT, we are using the new APIs, but this seems to think it should be the 
 old APIs. Note, this is an intermittent issue.  Sometimes it goes through 
 just fine.  Locally, it passes for me.
 
 Note, this could also be related to the Parallel tests stuff.
 
 -Grant
 
 On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server 
 jenk...@builds.apache.org wrote:
 
 Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec 
  FAILURE!
 testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest)  
 Time elapsed: 1.268 sec   FAILURE!
 org.junit.ComparisonFailure: 
 expected:TEST/subdir/[mail-messages].gz/u...@example.com but 
 was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com
  at org.junit.Assert.assertEquals(Assert.java:115)
  at org.junit.Assert.assertEquals(Assert.java:144)
  at 
 org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108)
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: 0.8 progress

2013-06-23 Thread Grant Ingersoll
I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any chance we can 
finish this up this week?

-Grant

On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Finally got to finishing up M-833, the changes can be reviewed at 
 https://reviews.apache.org/r/11774/diff/3/.
 
 
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org 
 Sent: Tuesday, June 11, 2013 10:09 AM
 Subject: Re: 0.8 progress
 
 
 I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by Thursday, I 
 can roll an RC on Thursday.
 
 -Grant
 
 On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote:
 
 Down to 4 issues!  I would say what they are, but JIRA is flaking out again.
 
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been working 
 hard to get M-833 in.  Not sure on M-1214, Robin?
 
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 M-1067 -- Dmitriy  --  This is an enhancement, should we push?
 
 Looks like this was committed already.
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: 0.8 progress

2013-06-15 Thread Grant Ingersoll
How's progress?

On Jun 12, 2013, at 8:46 PM, Grant Ingersoll gsing...@apache.org wrote:

 Fine by me.
 
 On Jun 12, 2013, at 6:12 PM, Robin Anil robin.a...@gmail.com wrote:
 
 +1 for monday. I would like this time to test MIA clustering code for the
 new version.
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Wed, Jun 12, 2013 at 4:13 PM, Suneel Marthi 
 suneel_mar...@yahoo.comwrote:
 
 I am in the same boat as Dan in finishing up M-833, just not finding the
 time. I should have time on the weekend to wrap this up.
 Grant,  could we have the release on Monday?
 
 
 
 
 
 From: Dan Filimon dangeorge.fili...@gmail.com
 To: Mahout-Dev dev@mahout.apache.org
 Sent: Wednesday, June 12, 2013 5:09 PM
 Subject: Re: 0.8 progress
 
 
 It turns out that my initial estimate of the time it takes to finish these
 issues was overly optimistic.
 I'm squashed between work and writing my thesis and unforeseen merging
 issues.
 
 So, I hate to say this, but could we please postpone this release till
 Monday?
 
 
 On Wed, Jun 12, 2013 at 1:11 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Sounds good.
 
 On Jun 11, 2013, at 4:36 PM, Dan Filimon dangeorge.fili...@gmail.com
 wrote:
 
 Sorry to rain on everyone's party, but I opened a few more issues I
 need
 to
 take of before 0.8 final that I had forgotten about.
 M-1253 to M-1256.
 
 I have code for all of these (that I tested, incidentally, that's the
 code
 I used for the experiments in the talk :), just need to merge it in
 and I
 wanted to have issues to mark as done to keep track of things.
 
 Should not take long and I should be done by Thursday.
 Also, would anyone like to review the code on ReviewBoard? :)
 
 
 On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
 Thursday, I can roll an RC on Thursday.
 
 -Grant
 
 On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Down to 4 issues!  I would say what they are, but JIRA is flaking out
 again.
 
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been
 working hard to get M-833 in.  Not sure on M-1214, Robin?
 
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 M-1067 -- Dmitriy  --  This is an enhancement, should we push?
 
 Looks like this was committed already.
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-14 Thread Grant Ingersoll
It seems to be that 6 pm ET is the consensus time for the majority of people, 
although my having screwed up the poll didn't help.

Bi-weekly is the other consensus.  It also looks like Tuesday or Thursday are 
the preferred dates.

I can't make next week, so I'm going to propose we kick off on Tuesday, June 25 
at 6 pm.  That will give us time to dry-run the Google Hangouts, etc.

Again, just to be clear, the goal here is to work on the development of Mahout, 
not to answer questions about how to run Mahout (we could do that separately if 
there is a desire.)

I'll send out a reminder as we get closer.

-Grant


On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 I am from Northern Virginia, how many of us here are from the Washington DC 
 Metro area?
 
 
 
 
 
 From: Jake Mannix jake.man...@gmail.com
 To: dev@mahout.apache.org dev@mahout.apache.org 
 Sent: Wednesday, June 12, 2013 1:56 PM
 Subject: Re: (Bi-)Weekly/Monthly Dev Sessions
 
 
 Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon
 when I get back from europe at the end of the summer!
 
 
 On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Bi-weekly is good for me; I'm in Seattle and just filled out the poll.
 
 Great idea!
 
 
 On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com
 wrote:
 
 +1, am in Seattle as well and would love to attend and be involved.
 
 Sent from my iPhone
 
 On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
 wrote:
 
 Good idea on recurring meetings. Im very interested in participating.
 Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.
 
 An agenda for the meetings ahead of time will help us get the most of
 our
 time at the meetings.
 
 Thanks.
 On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
 On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote:
 
 Angel and Suneel, you may want to re-fill out the new doodle.
 
 FYI, this week won't be representative of my schedule; I'm in the
 last
 few weeks of a job at ORNL where I travel every weekend. Normally I'll
 have
 more flexibility than just 6pm on weeknights.
 
 Yeah, Doodle makes you pick dates, but I just want it to be
 representative
 a week long period of time and not tied to a specific set of dates.
   So,
 just put in what your ideal times are in general and ignore the fact
 that
 it is set to next week.
 
 
 On 6/12/13 8:26 AM, Grant Ingersoll wrote:
 On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu
 wrote:
 
 +1, awesome idea
 
 One question: the poll, while set to GMT -5, does say it's in
 Central
 Time. Is this a daylight savings thing?
 I turned on Time Zone support, so not sure how it will look to
 others,
 but it sounds like it adjusts based on your location...  I see: 8 am,
 10,
 1, so on.
 
 I also realize, that I messed it up.  I meant 9 pm, not 9 am.
 
 Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 
 
 
 
 -- 
 
   -jake


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682108#comment-13682108
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

bq. But @Grant suggest we supply the patch of v0.7 first.

Yes, I was working under the assumption that an old patch is better than no 
patch.  A patch against HEAD is even better.  I think we have a few more days, 
so against HEAD would be great.

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682325#comment-13682325
 ] 

Grant Ingersoll commented on MAHOUT-944:


[~smarthi], the error only seems to happen when running all the tests and it 
seems to be intermittent.  It almost looks like some type of classpath issue.

 LuceneIndexToSequenceFiles (lucene2seq) utility
 ---

 Key: MAHOUT-944
 URL: https://issues.apache.org/jira/browse/MAHOUT-944
 Project: Mahout
  Issue Type: New Feature
  Components: Integration
Affects Versions: 0.5
Reporter: Frank Scholten
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
 MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
 MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
 MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch


 Here is a lucene2seq tool I used in a project. It creates sequence files 
 based on the stored fields of a lucene index.
 The output from this tool can be then fed into seq2sparse and from there you 
 can do text clustering.
 Comes with Java bean configuration.
 Let me know what you think. Some CLI code can be added later on. I used this 
 for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
 overkill?
 See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
 review comments from Simon Willnauer (Thanks Simon!)
 or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: 0.8 progress

2013-06-12 Thread Grant Ingersoll
Sounds good.

On Jun 11, 2013, at 4:36 PM, Dan Filimon dangeorge.fili...@gmail.com wrote:

 Sorry to rain on everyone's party, but I opened a few more issues I need to
 take of before 0.8 final that I had forgotten about.
 M-1253 to M-1256.
 
 I have code for all of these (that I tested, incidentally, that's the code
 I used for the experiments in the talk :), just need to merge it in and I
 wanted to have issues to mark as done to keep track of things.
 
 Should not take long and I should be done by Thursday.
 Also, would anyone like to review the code on ReviewBoard? :)
 
 
 On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll gsing...@apache.orgwrote:
 
 I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
 Thursday, I can roll an RC on Thursday.
 
 -Grant
 
 On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote:
 
 Down to 4 issues!  I would say what they are, but JIRA is flaking out
 again.
 
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been
 working hard to get M-833 in.  Not sure on M-1214, Robin?
 
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 M-1067 -- Dmitriy  --  This is an enhancement, should we push?
 
 Looks like this was committed already.
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







(Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Grant Ingersoll
Hi,

One of the things we kicked around at Buzzwords was having a 
weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with 
good success, I believe).  Since we are so spread out, I thought I would throw 
out a Doodle (scheduling tool for those unfamiliar) to see what times work best 
for the majority of people interested in such a thing.  Anyone is free to 
participate, but this is not a Q and A session, but is instead focused on 
writing code, fixing bugs, triaging JIRA, releasing, etc.

If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8  
(note, all times are Eastern Time Zone since I did the poll!)  I just grabbed a 
sampling of hours throughout the day.  I also picked 1 week as being 
representative of this being on a repeating schedule.  If none of the times 
work for you, but you are still interested, please respond here.  I would 
imagine we would meet for 1-2 hours.  

Also, please reply with the frequency at which you would like to meet:

[]  Weekly
[]  Bi-weekly (every 2 weeks)
[]  Monthly

My vote is every two weeks.

-Grant

Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Grant Ingersoll

On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote:

 +1, awesome idea
 
 One question: the poll, while set to GMT -5, does say it's in Central Time. 
 Is this a daylight savings thing?

I turned on Time Zone support, so not sure how it will look to others, but it 
sounds like it adjusts based on your location...  I see: 8 am, 10, 1, so on.

I also realize, that I messed it up.  I meant 9 pm, not 9 am.

Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv




[jira] [Commented] (MAHOUT-833) Make conversion to sequence files map-reduce

2013-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681206#comment-13681206
 ] 

Grant Ingersoll commented on MAHOUT-833:


The patch seems to be missing the WholeFileRecordReader.

 Make conversion to sequence files map-reduce
 

 Key: MAHOUT-833
 URL: https://issues.apache.org/jira/browse/MAHOUT-833
 Project: Mahout
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.7
Reporter: Grant Ingersoll
Assignee: Suneel Marthi
  Labels: MAHOUT_INTRO_CONTRIBUTE
 Fix For: 0.8

 Attachments: MAHOUT-833-final.patch, MAHOUT-833.patch, 
 MAHOUT-833.patch


 Given input that is on HDFS, the SequenceFilesFrom.java classes should be 
 able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Grant Ingersoll

On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote:

 Angel and Suneel, you may want to re-fill out the new doodle.
 
 FYI, this week won't be representative of my schedule; I'm in the last few 
 weeks of a job at ORNL where I travel every weekend. Normally I'll have more 
 flexibility than just 6pm on weeknights.

Yeah, Doodle makes you pick dates, but I just want it to be representative a 
week long period of time and not tied to a specific set of dates.  So, just put 
in what your ideal times are in general and ignore the fact that it is set to 
next week.

 
 On 6/12/13 8:26 AM, Grant Ingersoll wrote:
 On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote:
 
 +1, awesome idea
 
 One question: the poll, while set to GMT -5, does say it's in Central Time. 
 Is this a daylight savings thing?
 I turned on Time Zone support, so not sure how it will look to others, but 
 it sounds like it adjusts based on your location...  I see: 8 am, 10, 1, so 
 on.
 
 I also realize, that I messed it up.  I meant 9 pm, not 9 am.
 
 Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: In-Mapper combiner design pattern

2013-06-12 Thread Grant Ingersoll
Hi DB,

This all sounds rather interesting.  I see a number of places where we use 
combiners, so perhaps focus on those first?

Also, any thoughts on when the scalable SVM would be ready?  We are trying to 
get 1.0 out in the next few months and I personally think it would be good to 
have SVM in.

-Grant

On Jun 11, 2013, at 8:20 PM, DB Tsai dbt...@dbtsai.com wrote:

 Hi,
 
 Recently we started to use the in-mapper combiner design patterns in
 our hadoop based algorithms at Alpine Data Labs; those algorithms
 include variable selection using info gain, decision tree, naive bayes
 model and SVM, and we found that we can have 20~40% performance
 speedup without doing too much work.
 
 The whole idea is really simple, just use a in-mapper LRU cache to
 combine the result first instead of using combiner directly. If the
 cache is full, just emit the result to combiner or reducer. The detail
 is discussed in Data-Intensive Text Processing with MapReduce
 (http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf)
 by Jimmy Lin and Chris Dyer at University of Maryland, College Park.
 
 We would like to contribute the api to mahout, and work closer with
 open source community. I'm now working on random forest using
 information gain, and we have the plan to contribute to mahout
 community. We also have a scalable kernel SVM implementation which
 intends to contribute to mahout as well. We just presented a talk
 about our SVM in SF machine learning meetup with great feedback, see
 
 http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192a=uc1_te_af=event
 
 The api is pretty simple, just change context.write to combiner.write,
 and remember to flush the cache in the clean up method.
 
 This is the example of implementing hadoop classical word count using
 in-mapper combiner,
 https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java
 
 , and all we need to do is just change from context.write to
 combiner.write. The test code for this example is in
 https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
 
 This is the actually implementation of in-mapper combiner using LRU cache,
 https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java
 
 and this implementation is well tested.
 https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
 
 I'm wondering what is the best candidate in mahout to use this kind of
 in-mapper combiner now to demonstrate this idea works, and I'll focus
 on that particular use case, and do benchmark.
 
 Thanks.
 
 Sincerely,
 
 DB Tsai
 ---
 Web: http://www.dbtsai.com
 Phone : +1-650-383-8392


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681744#comment-13681744
 ] 

Grant Ingersoll commented on MAHOUT-944:


Suneel, weird.  I didn't see that before.  We are using the new APIs, AFAICT, 
so not sure what is going on.  So tired of the stupidity of the dual Map/Reduce 
APIs in Hadoop.

 LuceneIndexToSequenceFiles (lucene2seq) utility
 ---

 Key: MAHOUT-944
 URL: https://issues.apache.org/jira/browse/MAHOUT-944
 Project: Mahout
  Issue Type: New Feature
  Components: Integration
Affects Versions: 0.5
Reporter: Frank Scholten
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
 MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
 MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
 MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch


 Here is a lucene2seq tool I used in a project. It creates sequence files 
 based on the stored fields of a lucene index.
 The output from this tool can be then fed into seq2sparse and from there you 
 can do text clustering.
 Comes with Java bean configuration.
 Let me know what you think. Some CLI code can be added later on. I used this 
 for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
 overkill?
 See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
 review comments from Simon Willnauer (Thanks Simon!)
 or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: 0.8 progress

2013-06-12 Thread Grant Ingersoll
Fine by me.

On Jun 12, 2013, at 6:12 PM, Robin Anil robin.a...@gmail.com wrote:

 +1 for monday. I would like this time to test MIA clustering code for the
 new version.
 
 Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
 
 
 On Wed, Jun 12, 2013 at 4:13 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:
 
 I am in the same boat as Dan in finishing up M-833, just not finding the
 time. I should have time on the weekend to wrap this up.
 Grant,  could we have the release on Monday?
 
 
 
 
 
 From: Dan Filimon dangeorge.fili...@gmail.com
 To: Mahout-Dev dev@mahout.apache.org
 Sent: Wednesday, June 12, 2013 5:09 PM
 Subject: Re: 0.8 progress
 
 
 It turns out that my initial estimate of the time it takes to finish these
 issues was overly optimistic.
 I'm squashed between work and writing my thesis and unforeseen merging
 issues.
 
 So, I hate to say this, but could we please postpone this release till
 Monday?
 
 
 On Wed, Jun 12, 2013 at 1:11 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Sounds good.
 
 On Jun 11, 2013, at 4:36 PM, Dan Filimon dangeorge.fili...@gmail.com
 wrote:
 
 Sorry to rain on everyone's party, but I opened a few more issues I
 need
 to
 take of before 0.8 final that I had forgotten about.
 M-1253 to M-1256.
 
 I have code for all of these (that I tested, incidentally, that's the
 code
 I used for the experiments in the talk :), just need to merge it in
 and I
 wanted to have issues to mark as done to keep track of things.
 
 Should not take long and I should be done by Thursday.
 Also, would anyone like to review the code on ReviewBoard? :)
 
 
 On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
 Thursday, I can roll an RC on Thursday.
 
 -Grant
 
 On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Down to 4 issues!  I would say what they are, but JIRA is flaking out
 again.
 
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been
 working hard to get M-833 in.  Not sure on M-1214, Robin?
 
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 M-1067 -- Dmitriy  --  This is an enhancement, should we push?
 
 Looks like this was committed already.
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Updated] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2013-06-11 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1030:


Fix Version/s: (was: 0.8)
   1.0

I'm going to push this.  I know that for 0.9 we are looking at reworking the 
way we handle vectors and their associated properties (i.e. get rid of 
NamedVector, etc.)

 Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
 WeightedVectorWritable
 

 Key: MAHOUT-1030
 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
 Project: Mahout
  Issue Type: Bug
  Components: Clustering, Integration
Affects Versions: 0.7
Reporter: Jeff Eastman
Assignee: Suneel Marthi
 Fix For: 1.0

 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch


 Looks like this won't make it into this build. Pretty widespread impact on 
 code and tests and I don't know which properties were implemented in the old 
 version. I will create a JIRA and post my interim results.
 On 6/8/12 12:21 PM, Jeff Eastman wrote:
  That's a reversion that evidently got in when the new 
  ClusterClassificationDriver was introduced. It should be a pretty easy fix 
  and I will see if I can make the change before Paritosh cuts the release 
  bits tonight.
 
  On 6/7/12 1:00 PM, Pat Ferrel wrote:
  It appears that in kmeans the clusteredPoints are now written as 
  WeightedVectorWritable where in mahout 0.6 they were 
  WeightedPropertyVectorWritable? This means that the distance from the 
  centroid is no longer stored here? Why? I hope I'm wrong because that is 
  not a welcome change. How is one to order clustered docs by distance from 
  cluster centroid?
 
  I'm sure I could calculate the distance but that would mean looking up the 
  centroid for the cluster id given in the above WeightedVectorWritable, 
  which means iterating through all the clusters for each clustered doc. In 
  my case the number of clusters could be fairly large.
 
  Am I missing something?
 
 
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680392#comment-13680392
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

Any update on this for applying against trunk/0.8?

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: matrix_1, matrix_2, SpectralKMeans.patch


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: 0.8 progress

2013-06-11 Thread Grant Ingersoll
I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by Thursday, I 
can roll an RC on Thursday.

-Grant

On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote:

 Down to 4 issues!  I would say what they are, but JIRA is flaking out again.
 
 My instinct is that 1030 and 1233 can be pushed.  Suneel has been working 
 hard to get M-833 in.  Not sure on M-1214, Robin?
 
 -G
 
 On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 
 On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 M-1067 -- Dmitriy  --  This is an enhancement, should we push?
 
 Looks like this was committed already.
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Updated] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2013-06-11 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1030:


Fix Version/s: 0.9

 Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
 WeightedVectorWritable
 

 Key: MAHOUT-1030
 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
 Project: Mahout
  Issue Type: Bug
  Components: Clustering, Integration
Affects Versions: 0.7
Reporter: Jeff Eastman
Assignee: Suneel Marthi
 Fix For: 1.0, 0.9

 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch


 Looks like this won't make it into this build. Pretty widespread impact on 
 code and tests and I don't know which properties were implemented in the old 
 version. I will create a JIRA and post my interim results.
 On 6/8/12 12:21 PM, Jeff Eastman wrote:
  That's a reversion that evidently got in when the new 
  ClusterClassificationDriver was introduced. It should be a pretty easy fix 
  and I will see if I can make the change before Paritosh cuts the release 
  bits tonight.
 
  On 6/7/12 1:00 PM, Pat Ferrel wrote:
  It appears that in kmeans the clusteredPoints are now written as 
  WeightedVectorWritable where in mahout 0.6 they were 
  WeightedPropertyVectorWritable? This means that the distance from the 
  centroid is no longer stored here? Why? I hope I'm wrong because that is 
  not a welcome change. How is one to order clustered docs by distance from 
  cluster centroid?
 
  I'm sure I could calculate the distance but that would mean looking up the 
  centroid for the cluster id given in the above WeightedVectorWritable, 
  which means iterating through all the clusters for each clustered doc. In 
  my case the number of clusters could be fairly large.
 
  Am I missing something?
 
 
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

2013-06-11 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1233.
-

Resolution: Incomplete

Please reopen if you have a repeatable test case, as I am not sure there is an 
issue here.

 Problem in processing datasets as a single chunk vs many chunks in HADOOP 
 mode in mostly all the clustering algos
 -

 Key: MAHOUT-1233
 URL: https://issues.apache.org/jira/browse/MAHOUT-1233
 Project: Mahout
  Issue Type: Question
  Components: Clustering
Affects Versions: 0.7, 0.8
Reporter: yannis ats
Assignee: yannis ats
Priority: Minor
 Fix For: 0.8


 I am trying to process a dataset and i do it in two ways.
 Firstly i give it as a single chunk(all the dataset) and secondly as many 
 smaller chunks in order to increase the throughput of my machine.
 The problem is that when i perform the single chunk computation the results 
 are fine 
 and by fine i mean that if i have in the input 1000 vectors i get in the 
 output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
 and fuzzy kmeans).
 However when i split the dataset in order to speed up the computations then 
 strange phenomena occur.
 For instance the same dataset that contains 1000 vectors and is split in  for 
 example 10 files then in the output i will obtain more vector ids(w.g 1100 
 vectorids with their corresponding clusterids).
 The question is, am i doing something wrong in the process?
 Is there a problem in clusterdump and seqdumper when the input is in many 
 files?
 I have observed when mahout is performing the computations that in the screen 
 says that processed the correct number of vectors.
 Am i missing something?
 I use as input the transformed to mvc weka vectors.
 I have tried this in v0.7 and the v0.8 snapshot.
 Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Random Errors

2013-06-10 Thread Grant Ingersoll
That was the whole stack trace, unfortunately.

On Jun 10, 2013, at 2:35 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:

 Grant, top of the stack trace is not sufficient to tell what was the
 offending thread. Copy-paste the entire stack, including nested
 exceptions. The console will also contain a full stack trace
 information at the moment the test framework detected a thread leak.
 It should be easy to tell what isn't cleaned up properly.
 
 Dawid




[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679817#comment-13679817
 ] 

Grant Ingersoll commented on MAHOUT-1147:
-

Jake, are you up to date?  I fixed a bunch of things related to 
cluster-reuters.  Also, do you have HADOOP-HOME set?  Or MAHOUT-LOCAL?

 CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
 matrix
 ---

 Key: MAHOUT-1147
 URL: https://issues.apache.org/jira/browse/MAHOUT-1147
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.7
 Environment: Eclipse IDE
 Java code base
 CVB0Driver Class
 setModelPaths(Job job, Path modelPath) - method
Reporter: Jack Pay
Assignee: Jake Mannix
  Labels: bug, cvb, fix, suggestion
 Fix For: 0.8

 Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Problem:
 When training doc/topic model no paths for the term/topic model found 
 (outputs null).
 These paths are set using setModelPaths in CVB0Driver.
 Reason for Problem:
 Variety of Job instances call this method. 
 The Job is passed to the method instead of the Configuration object given to 
 the Job.
 The configuration is retrieved from the Job instance itself.
 I believe that this Configuration instance is a clone of the original.
 This is a problem as the variable MODEL_PATHS is set on the clone which is 
 then discarded when the given Job is complete.
 The original Configuration has no MODEL_PATHS String set and therefore 
 returns null.
 The code stipulates that if it cannot find a model to use a new random 
 matrix. This happens every time as MODEL_PATHS is not set for the 
 Configuration instance used.
 Solution:
 Do not pass the Job to the setModels method, but pass the Configuration 
 instance passed into the method which created the Job.
 i.e.
 change from:
 setModelPaths(Job job, Path modelPath)
 to:
 setModelPaths(Configuration conf, Path modelPath)
 And change all calling methods accordingly (obviously).
 So far what little testing I have done appears to solve this problem.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679855#comment-13679855
 ] 

Grant Ingersoll commented on MAHOUT-1147:
-

Hmm, I tested k-means cluster-reuters.sh last night on Hadoop single node and 
it worked fine.  I added a step to copy the reuters-out up to HDFS.  Let me 
make sure I pushed (see MAHOUT-1247)

 CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
 matrix
 ---

 Key: MAHOUT-1147
 URL: https://issues.apache.org/jira/browse/MAHOUT-1147
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.7
 Environment: Eclipse IDE
 Java code base
 CVB0Driver Class
 setModelPaths(Job job, Path modelPath) - method
Reporter: Jack Pay
Assignee: Jake Mannix
  Labels: bug, cvb, fix, suggestion
 Fix For: 0.8

 Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Problem:
 When training doc/topic model no paths for the term/topic model found 
 (outputs null).
 These paths are set using setModelPaths in CVB0Driver.
 Reason for Problem:
 Variety of Job instances call this method. 
 The Job is passed to the method instead of the Configuration object given to 
 the Job.
 The configuration is retrieved from the Job instance itself.
 I believe that this Configuration instance is a clone of the original.
 This is a problem as the variable MODEL_PATHS is set on the clone which is 
 then discarded when the given Job is complete.
 The original Configuration has no MODEL_PATHS String set and therefore 
 returns null.
 The code stipulates that if it cannot find a model to use a new random 
 matrix. This happens every time as MODEL_PATHS is not set for the 
 Configuration instance used.
 Solution:
 Do not pass the Job to the setModels method, but pass the Configuration 
 instance passed into the method which created the Job.
 i.e.
 change from:
 setModelPaths(Job job, Path modelPath)
 to:
 setModelPaths(Configuration conf, Path modelPath)
 And change all calling methods accordingly (obviously).
 So far what little testing I have done appears to solve this problem.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679858#comment-13679858
 ] 

Grant Ingersoll commented on MAHOUT-1147:
-

Do you see:
{code}
echo Extracting Reuters
$MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters 
${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-out
if [ $HADOOP_HOME !=  ]  [ $MAHOUT_LOCAL ==  ] ; then
echo Copying Reuters data to Hadoop
set +e
$HADOOP dfs -rmr ${WORK_DIR}/reuters-sgm
$HADOOP dfs -rmr ${WORK_DIR}/reuters-out
set -e
$HADOOP dfs -put ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-sgm
$HADOOP dfs -put ${WORK_DIR}/reuters-out ${WORK_DIR}/reuters-out
fi
{code}

Also, I'm on #mahout on IRC if that helps us resolve this faster.

 CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
 matrix
 ---

 Key: MAHOUT-1147
 URL: https://issues.apache.org/jira/browse/MAHOUT-1147
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.7
 Environment: Eclipse IDE
 Java code base
 CVB0Driver Class
 setModelPaths(Job job, Path modelPath) - method
Reporter: Jack Pay
Assignee: Jake Mannix
  Labels: bug, cvb, fix, suggestion
 Fix For: 0.8

 Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Problem:
 When training doc/topic model no paths for the term/topic model found 
 (outputs null).
 These paths are set using setModelPaths in CVB0Driver.
 Reason for Problem:
 Variety of Job instances call this method. 
 The Job is passed to the method instead of the Configuration object given to 
 the Job.
 The configuration is retrieved from the Job instance itself.
 I believe that this Configuration instance is a clone of the original.
 This is a problem as the variable MODEL_PATHS is set on the clone which is 
 then discarded when the given Job is complete.
 The original Configuration has no MODEL_PATHS String set and therefore 
 returns null.
 The code stipulates that if it cannot find a model to use a new random 
 matrix. This happens every time as MODEL_PATHS is not set for the 
 Configuration instance used.
 Solution:
 Do not pass the Job to the setModels method, but pass the Configuration 
 instance passed into the method which created the Job.
 i.e.
 change from:
 setModelPaths(Job job, Path modelPath)
 to:
 setModelPaths(Configuration conf, Path modelPath)
 And change all calling methods accordingly (obviously).
 So far what little testing I have done appears to solve this problem.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Welcome new committers Gokhan Capan and Stevo Slavic

2013-06-10 Thread Grant Ingersoll
Please join me in congratulating Mahout's newest committers, Gokhan Capan and 
Stevo Slavic, both of whom have been contributing to Mahout for some time now.

Gokhan, Stevo, new committer tradition is to give a brief background on 
yourself, so you have the floor!

Congrats,
Grant


[jira] [Created] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)
Grant Ingersoll created MAHOUT-1247:
---

 Summary: cluster-reuters doesn't work on Hadoop
 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
 Fix For: 0.8


At least two issues:

1. MAHOUT-992 messed up the Distributed Cache stuff somehow
2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-1247:
---

Assignee: Grant Ingersoll

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1126) Mac builds won't unjar

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1126.
-

Resolution: Fixed

I think the filter I put in place should (hopefully) fix this going forward.

 Mac builds won't unjar
 --

 Key: MAHOUT-1126
 URL: https://issues.apache.org/jira/browse/MAHOUT-1126
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.8
 Environment: Builds on the Mac
Reporter: Pat Ferrel
Assignee: Grant Ingersoll
  Labels: build
 Fix For: 0.8


 On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
 unjar mahout. The Mac has a case insensitive file system and so can't tell 
 the difference between LICENSE and license. This was fixed at one point 
 https://issues.apache.org/jira/browse/MAHOUT-780
 zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
 META-INF/license/
 zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
 META-INF/LICENSE/
 Looks like as is mentioned in 
 https://issues.apache.org/jira/browse/MAHOUT-780 
 mv target/maven-shared-archive-resources/META-INF/LICENSE 
 target/maven-shared-archive-resources/META-INF/LICENSES
 works too.
 Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1103.
-

Resolution: Fixed

 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Grant Ingersoll
  Labels: clusterpp
 Fix For: 0.8

 Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch, MAHOUT-1103.patch


 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Random Errors

2013-06-09 Thread Grant Ingersoll
I get a failure on the one below when running in parallel, but not standalone: 

Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 10.358 sec  
FAILURE!
testRun(org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest)  Time 
elapsed: 10.358 sec   FAILURE!
java.lang.AssertionError: expected:2002 but was:0
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest.testRun(SequenceFilesFromLuceneStorageMRJobTest.java:73)


Interesting thing about this one is the Test class has only a single test and 
it has no randomization.

FWIW, it's also becoming increasingly clear to me that we need some notion of 
real integration tests that we can run against a Hadoop cluster (or at least a 
virtual Hadoop cluster).

-Grant

On Jun 8, 2013, at 9:38 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:

 number generators. Where a test depends on a particular sequence, and
 somewhere an RNG doesn't use the RandomUtils trick, it may have a
 different state if other tests ran before.
 
 I have a different solution for this in randomizedtesting framework (a
 Random instance cannot be shared from test to test, it will throw an
 exception if you do share it). This doesn't solve all the possible
 problems but proved quite effective at catching test dependencies.
 
 The surefire parameter just controls what order the *classes* run in AFAICT:
 http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#runOrder
 
 Yeah, I was on the train when I wrote that e-mail. The trick I
 remembered is in fact inside JUnit 4.11 and onwards --
 https://github.com/junit-team/junit/blob/master/doc/ReleaseNotes4.11.md#test-execution-order
 
 D.


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Assigned] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-1211:
---

Assignee: Grant Ingersoll  (was: Ted Dunning)

 Replace deprecated Closables.closeQuietly calls
 ---

 Key: MAHOUT-1211
 URL: https://issues.apache.org/jira/browse/MAHOUT-1211
 Project: Mahout
  Issue Type: Improvement
Reporter: Stevo Slavic
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1211.patch


 Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
 usage is a code smell, and that method is scheduled to be removed from Guava 
 16.0.
 See [this 
 discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
 for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: 0.8 progress

2013-06-09 Thread Grant Ingersoll
I'm on M-1211 and 1247 (M-992 is related)  Will be on IRC for a few hours this 
morning.

-Grant

On Jun 9, 2013, at 1:48 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Working on M-833.
 
 From: Suneel Marthi suneel_mar...@yahoo.com
 To: dev@mahout.apache.org dev@mahout.apache.org 
 Sent: Saturday, June 8, 2013 6:09 PM
 Subject: Re: 0.8 progress
 
 I will be looking at M-833 and M-1030 tonight.
 
 I can get the initial limited functionality for M-884 as part of 0.8 release 
 by tomorrow. Thanks to Robin for reviewing.
 
 
 
 
 
 
 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@mahout.apache.org 
 Sent: Saturday, June 8, 2013 5:09 PM
 Subject: Re: 0.8 progress
 
 
 I've got 1103 and 1126 close to done.  Should be in by tomorrow.
 
 On Jun 8, 2013, at 4:18 PM, Robin Anil robin.a...@gmail.com wrote:
 
  Down to 15.
  
  Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
  
  
  On Sat, Jun 8, 2013 at 12:30 PM, Suneel Marthi 
  suneel_mar...@yahoo.comwrote:
  
  I am done with M-1026.
  
  
  
  
  
  From: Grant Ingersoll gsing...@apache.org
  To: dev@mahout.apache.org
  Sent: Saturday, June 8, 2013 10:42 AM
  Subject: Re: 0.8 progress
  
  
  Hmm, JIRA seems to be down...
  
  1084 is in.  I'm pretty close to being done on 1103.
  
  I'm on #mahout on Freenode if anyone wants to coordinate, and will be
  there for the next 1 hour or so.
  
  On Jun 8, 2013, at 7:21 AM, Grant Ingersoll gsing...@apache.org wrote:
  
  We are down to 18 issues!  Let's keep cranking.
  
  I'm working on 1103 and 1084 at the moment.
  
  On Jun 6, 2013, at 12:00 PM, Grant Ingersoll gsing...@apache.org
  wrote:
  
  
  On Jun 6, 2013, at 12:12 PM, Sebastian Schelter 
  ssc.o...@googlemail.com wrote:
  
  Hi Grant,
  
  Here's my take:
  
  Will/Must be finished:
  M-944[include]
  
  ^ Committed.
  
  M-958 [include]
  M-975[include]
  M-1084 [include]
  M-1098  [include]
  M-1103 [include]
  M-1126[push if no one steps up]
  M-1147  [include]
  M-1211  [push if no one steps up]
  M-1233  [push if no one steps up]
  M-1241  [include]
  
  Can be pushed if no one steps up:
  M-627 [push if no one steps up]
  M-833 [push if no one steps up]
  M-1163 [push if no one steps up]
  M-1164[push if no one steps up]
  M-1243[include]
  M-992 [include]
  
  ^ Working on this now.
  
  M-996 [push if no one steps up]
  M-1067[include]
  
  Unsure:
  M-974 [push if no one steps up]
  M-1026 [push if no one steps up]
  M-1030 [unsure]
  
  
  On 06.06.2013 11:26, Grant Ingersoll wrote:
  Working from the link below, we are down to 22 issues.
  
  
  https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
  
  Here's my opinion (and only my opinion, please vote, change as you
  see fit) based on a cursory glance of the state of these as to what needs
  to be in the release and what can be pushed:
  
  Will/Must be finished:
  M-944
  M-958
  M-975
  M-1084
  M-1098
  M-1103
  M-1126
  M-1147
  M-1211
  M-1233
  M-1241
  
  Can be pushed if no one steps up:
  M-627
  M-833
  M-1163
  M-1164
  M-1243
  M-992
  M-996
  M-1067
  
  Unsure:
  M-974
  M-1026
  M-1030
  
  
  
  Grant Ingersoll | @gsingers
  http://www.lucidworks.com
  
  
  
  
  
  
  
  
  
  Grant Ingersoll | @gsingers
  http://www.lucidworks.com
  
  
  
  
  
  
  
  Grant Ingersoll | @gsingers
  http://www.lucidworks.com
  
  
  
  
  
  
  
  Grant Ingersoll | @gsingers
  http://www.lucidworks.com
  
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679048#comment-13679048
 ] 

Grant Ingersoll commented on MAHOUT-1211:
-

Patch coming shortly based off of Suneel's original patch.  Would appreciate 
some eyeballs before committing.  I went with Sean's approach for readers and 
writers.  I think Dmitriy has a valid point, but perhaps we take it on a case 
by case base to see if any harm comes out of quietly closing readers.

 Replace deprecated Closables.closeQuietly calls
 ---

 Key: MAHOUT-1211
 URL: https://issues.apache.org/jira/browse/MAHOUT-1211
 Project: Mahout
  Issue Type: Improvement
Reporter: Stevo Slavic
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1211.patch


 Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
 usage is a code smell, and that method is scheduled to be removed from Guava 
 16.0.
 See [this 
 discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
 for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1211:


Attachment: MAHOUT-1211.patch

Updated patch to trunk

 Replace deprecated Closables.closeQuietly calls
 ---

 Key: MAHOUT-1211
 URL: https://issues.apache.org/jira/browse/MAHOUT-1211
 Project: Mahout
  Issue Type: Improvement
Reporter: Stevo Slavic
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch


 Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
 usage is a code smell, and that method is scheduled to be removed from Guava 
 16.0.
 See [this 
 discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
 for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Random Errors

2013-06-09 Thread Grant Ingersoll
Tests run: 100, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.75 sec  
FAILURE!
testViewSequentialAccessSparseVectorWritable {#1 
seed=[34643F377C10C8B9:3D6AC6E0C554E86F]}(org.apache.mahout.math.VectorWritableTest)
  Time elapsed: 0.423 sec   ERROR!
com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from TEST 
scope at testViewSequentialAccessSparseVectorWritable {#1 
seed=[34643F377C10C8B9:3D6AC6E0C554E86F]}(org.apache.mahout.math.VectorWritableTest):
 
   1) Thread[id=13, name=Thread-2, state=RUNNABLE, group=main]
at com.apple.java.Application.getAppBundleIdNative(Native Method)
at com.apple.java.Application.getAppBundleId(Application.java:19)
at com.apple.java.Usage.performReport(Usage.java:52)
at com.apple.java.Usage.performAfterDelay(Usage.java:27)
at 
__randomizedtesting.SeedInfo.seed([34643F377C10C8B9:3D6AC6E0C554E86F]:0)


This may be a hint.  Don't get it when running it standalone...

On Jun 9, 2013, at 8:50 AM, Sebastian Schelter ssc.o...@googlemail.com wrote:

 I observe a similar behavior.
 
 On 09.06.2013 14:47, Grant Ingersoll wrote:
 I get a failure on the one below when running in parallel, but not 
 standalone: 
 
 Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 10.358 sec 
  FAILURE!
 testRun(org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest)  
 Time elapsed: 10.358 sec   FAILURE!
 java.lang.AssertionError: expected:2002 but was:0
  at org.junit.Assert.fail(Assert.java:88)
  at org.junit.Assert.failNotEquals(Assert.java:743)
  at org.junit.Assert.assertEquals(Assert.java:118)
  at org.junit.Assert.assertEquals(Assert.java:555)
  at org.junit.Assert.assertEquals(Assert.java:542)
  at 
 org.apache.mahout.text.SequenceFilesFromLuceneStorageMRJobTest.testRun(SequenceFilesFromLuceneStorageMRJobTest.java:73)
 
 
 Interesting thing about this one is the Test class has only a single test 
 and it has no randomization.
 
 FWIW, it's also becoming increasingly clear to me that we need some notion 
 of real integration tests that we can run against a Hadoop cluster (or at 
 least a virtual Hadoop cluster).
 
 -Grant
 
 On Jun 8, 2013, at 9:38 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:
 
 number generators. Where a test depends on a particular sequence, and
 somewhere an RNG doesn't use the RandomUtils trick, it may have a
 different state if other tests ran before.
 
 I have a different solution for this in randomizedtesting framework (a
 Random instance cannot be shared from test to test, it will throw an
 exception if you do share it). This doesn't solve all the possible
 problems but proved quite effective at catching test dependencies.
 
 The surefire parameter just controls what order the *classes* run in 
 AFAICT:
 http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#runOrder
 
 Yeah, I was on the train when I wrote that e-mail. The trick I
 remembered is in fact inside JUnit 4.11 and onwards --
 https://github.com/junit-team/junit/blob/master/doc/ReleaseNotes4.11.md#test-execution-order
 
 D.
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679053#comment-13679053
 ] 

Grant Ingersoll commented on MAHOUT-1211:
-

I committed this, but we can leave open for others to review and tweak, but it 
should be able to be closed before the release.

 Replace deprecated Closables.closeQuietly calls
 ---

 Key: MAHOUT-1211
 URL: https://issues.apache.org/jira/browse/MAHOUT-1211
 Project: Mahout
  Issue Type: Improvement
Reporter: Stevo Slavic
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch


 Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
 usage is a code smell, and that method is scheduled to be removed from Guava 
 16.0.
 See [this 
 discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
 for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679074#comment-13679074
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

Here's the first error I'm getting: https://paste.apache.org/cik6
{quote}
java.lang.IllegalStateException: 
/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
at 
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:146)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://localhost:9000/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1479)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1474)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.init(SequenceFileIterator.java:58)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
... 9 more
{quote}

Might be related to MAHOUT-992, but not sure.  I added a main to 
DictionaryVectorizer that allows you to reproduce this off of the prior run of 
cluster-reuters without having to go re-run everything.

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679076#comment-13679076
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

After you run cluster-reuters.sh, you can run:
{code}bin/mahout org.apache.mahout.vectorizer.DictionaryVectorizer -i 
/tmp/mahout-work-grantingersoll/reuters-out-seqdir-sparse-kmeans/tokenized-documents
 -o ./dicVec{code}

Make sure you have HADOOP_HOME set and also substitute in the appropriate work 
directory.

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679090#comment-13679090
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

I think I see the issue.  The cache file is local, the Iterator, however, has 
a Hadoop conf that is expecting an HDFS file, hence it can't find it.

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-975) Bug in Gradient Machine - Computation of the gradient

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679143#comment-13679143
 ] 

Grant Ingersoll commented on MAHOUT-975:


[~tdunning] Any chance this is getting in this week?

 Bug in Gradient Machine  - Computation of the gradient
 --

 Key: MAHOUT-975
 URL: https://issues.apache.org/jira/browse/MAHOUT-975
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7
Reporter: Christian Herta
Assignee: Ted Dunning
 Fix For: 0.8

 Attachments: GradientMachine.patch


 The initialisation to compute the gradient descent weight updates for the 
 output units should be wrong:
  
 In the comment: dy / dw is just w since  y = x' * w + b.
 This is wrong. dy/dw is x (ignoring the indices). The same initialisation is 
 done in the code.
 Check by using neural network terminology:
 The gradient machine is a specialized version of a multi layer perceptron 
 (MLP).
 In a MLP the gradient for computing the weight change for the output units 
 is:
 dE / dw_ij = dE / dz_i * dz_i / d_ij with z_i = sum_j (w_ij * a_j)
 here: i index of the output layer; j index of the hidden layer
 (d stands for the partial derivatives)
 here: z_i = a_i (no squashing in the output layer)
 with the special loss (cost function) is  E = 1 - a_g + a_b = 1 - z_g + z_b
 with
 g index of output unit with target value: +1 (positive class)
 b: random output unit with target value: 0
 =
 dE / dw_gj = -dE/dz_g * dz_g/dw_gj = -1 * a_j (a_j: activity of the hidden 
 unit
 j)
 dE / dw_bj = -dE/dz_b * dz_b/dw_bj = +1 * a_j (a_j: activity of the hidden 
 unit
 j)
 That's the same if the comment would be correct:
 dy /dw = x (x is here the activation of the hidden unit) * (-1) for weights to
 the output unit with target value +1.
 
 In neural network implementations it's common to compute the gradient
 numerically for a test of the implementation. This can be done by:
 dE/dw_ij = (E(w_ij + epsilon) -E(w_ij - epsilon) ) / (2* (epsilon))

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1247.
-

Resolution: Fixed

Fixed by MAHOUT-992

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   4   5   6   7   8   9   10   >