Re: Hi ... need some help?
Chris, This is really nice work. On Wed, Apr 22, 2020 at 1:46 AM Christofer Dutz wrote: > Hi Andrew, > > thanks for your kind words ... they are sort of the fuel that makes me run > ;-) > > So some general observations and suggestions: > - You seem to use test-jars quite a bit: These are generally considered an > antipattern as you possibly import problems from another module and you > will have no way of detecting them. If you need shared test-code it's > better practice to create a dedicated test-utils module and include that > wherever it's needed. > - Don't use variables for project dependencies: It makes things slightly > more difficult to read the release plugin takes care of updating version > for you and some third party plugins might have issues with it. > - I usually provide versions for all project dependencies and have all > other dependencies managed in a dependencyManagement section of the root > module this avoids problems with version conflicts when constructing > something using multiple parts of your projects (Especially your lib > directory thing) > - Accessing resources outside of the current modules scope is generally > considered an antipattern ... regarding your lib thing, I would suggest an > assembly that builds a directory (but I do understand that this version > perhaps speeds up the development workflow ... we could move the clean > plugin configuration and the antrun plugin config into a profile dedicated > for development) > - I usually order the plugin configurations (as much as possible) the way > they are usually executed in the build ... so: clean, process resources, > compile, test, package, ... this makes it easier to understand the build in > general. > > Today I'll go through the poms again managing all versions and cleaning up > the order of things. Then if all still works I would bump the dependencies > versions up as much as possible. > > Will report back as soon as I'm though or I've got something to report ... > then I'll also go into details with your feedback (I haven't ignored it ;-) > ) > > Chris > > > > Am 22.04.20, 06:08 schrieb "Andrew Palumbo" : > > Fixing previous message.. > > > Quote from Chris Dutz: > > > Hi folks, > >so I was now able to build (including all tests) with Java 8 and > 9 ... currently trying 10 ... > >Are there any objection that some maven dependencies get updated > to more recent versions? I mean ... the hbase-client you're using is more > than 5 years old ... > > My answer: > > I personally have no problem with the updating of any dependencies, > they may break some things and caue more work, but that is the kind of > thing that we've been trying to get done in this build work, get > everything up to speed. > > Id say take Andrew, Trevor and Pat's word over mine though i am a bit > less active presently. > > Thanks. > > Andy > > > From: Andrew Palumbo > Sent: Tuesday, April 21, 2020 10:17 PM > To: dev@mahout.apache.org > Subject: Re: Hi ... need some help? > > Hi folks, > > so I was now able to build (including all tests) with Java 8 and 9 > ... currently trying 10 ... > > Are there any objection that some maven dependencies get updated > to more recent versions? I mean ... the hbase-client you're using is more > than 5 years old ... > Not by me, I believe that is being used by the MR module, which is > Deprecated. > > I personally have no problem with the updating of any dependencies, > they may break some things and caue more work, but that is the kind of > thing that we've been trying to get done in this build work, get > everything up to speed. > > Id say take Andrew, Trevor and Pat's word over mine though i am a bit > less active presently. > > Thanks. > > Andy > > From: Andrew Palumbo > Sent: Tuesday, April 21, 2020 10:13 PM > To: dev@mahout.apache.org > Subject: Re: Hi ... need some help? > > Chris, Thank you so much for what you are doing, This is Apache at > its best.. I've been down and out with a serious Illness, Injury and other > issues, which have seriously limited my Machine time. I was pretty close > to getting a good build, but it was hacky, and the method that you use to > name the modules for both Scala versions, looks great. > > We've always relied on Stevo to fix the builds for us, but as he said > is unable to contribute right now. The main issues (solved by hacks), > currently are > > > 1. Dependencies and transitive dependencies are not being picked > and copied to the `./lib` directory, where `/bin/mahout` and parts of the > MahoutSparkContext look for them, to add to the class path. So running > either from the CLI or as a library, dependencies are not picked up. > * We used to use the mahout-experimental-xx.jar as a fat jar > for this, though it was bloated with now
Re: [jira] [Created] (MAHOUT-2094) Advanced Excel Training In Pune
I already took a look. I couldn't even delete the post. I did file an infra JIRA to block the poster. Anybody can file a similar JIRA to limit issue creation to committers. I would wait for a second occurrence, however. On Sat, Feb 29, 2020 at 12:29 PM Giorgio Zoppi wrote: > Ok, > ted, could you help on this? > BR, > Giorgio >
Re: [jira] [Created] (MAHOUT-2092) Machine Learning is an extensive area of Artificial Intelligence focused on the classical design
I removed the spammy content and asked infra to blacklist the poster. On Mon, Feb 24, 2020 at 2:59 AM Giorgio Zoppi wrote: > This should be not permitted. We dont care about ML courses, if we want a > course we look for ourselves. > BR, > Giorgio > > El lun., 24 feb. 2020 a las 11:45, Diksha Kakade (Jira) ( >) > escribió: > > > Diksha Kakade created MAHOUT-2092: > > - > > > > Summary: Machine Learning is an extensive area of Artificial > > Intelligence focused on the classical design > > Key: MAHOUT-2092 > > URL: https://issues.apache.org/jira/browse/MAHOUT-2092 > > Project: Mahout > > Issue Type: Blog - New Blog Request > > Components: Classification > > Affects Versions: 0.12.2 > > Reporter: Diksha Kakade > > Fix For: 14.2 > > Attachments: Machine_Learning_Header-compressor.jpg > > > > Master in Machine learning, Artificial Intelligence and Big Data workshop > > as part of their AI and Deep Learning training at SevenMentor training in > > Pune. Our Machine Learning course in Pune at SevenMentor, syllabus > > comprises the latest algorithms such as ANN, MLP RNN Autoencoders and > > moreover this app is considered to be the best Machine learning class in > > this region. There are a whole lot of amazing Artificial intelligence > > projects offered and nearly many of our candidates went to integrate with > > the fortune 100 firms. Students studying artificial intelligence training > > and Machine learning education, big data training are rigorously trained > > using live sector applicable case studies. What are you waiting for? > > Register now for the absolute best Machine Learning course in Pune at > > SevenMentor training pioneering your career into the AI companies and > learn > > the updated concepts. > > > > > > > > > > > > *Proficiency After Coaching* > > > > Specialist in Machine learning, Information Evaluation > > > > > > > > Willing to Operate on statistical Theories using python or R programming > > > > > > > > Willing to Operate with AI > > > > > > > > Have a Fantastic Comprehension of Data Science Algorithms > > > > > > > > Ready to solely operate on real-time Jobs with R > > > > > > > > Examine several Kinds of data using R > > > > > > > > Discover Techniques and Tools for Information Transformation > > > > > > > > Gain insights from Information and Picture it > > > > > > > > Utilize different Document formats and types of Information > > > > > > > > > > > > > > *Machine Learning Training at Pune* > > > > Machine Learning Course Objectives > > > > You'll find an overview of how humongous amounts of data has been > > generated, the best way to draw substantial business insights, techniques > > used to analyze unstructured and structured information, newest machine > > learning algorithms used to build innovative prediction models and the > way > > to visualize data. All these are learned in view of solving many complex > > business issues and making organizations profitable. Case studies that > are > > industry relevant have been making our pupils achieve accolades from the > > world's greatest businesses and stick out of the rest. We provide hands > on > > practical training on Machine Learning course in Pune at SevenMentor. Our > > pupils are leaving footprints from the corporate world by becoming. > > > > > > > > *How Can Machine Learning work?* > > > > The design of Machine learning course in at Pune is done using a training > > data set that a variant can be produced. The fact of the prediction is > > assessed based on ML algorithm that is set up if the precision is > > acceptable. For instances where precision isn't acceptable, the Machine > > Learning algorithm is provided using supplemental training information > > set.There are several variables and measures involved. This is a great > > example of the process. > > > > Machine learning has transformed various sectors of businesses like > > retail, healthcare, finance, etc. Depending on the trends in engineering, > > these are a couple predictions that have been made related to Machine > > Learning's potential . > > > > Personalization calculations of Machine Learning provide recommendations > > to clients and bring them to complete certain activities. The further > > personalization algorithms will grow more fine-tuned, which will lead to > > favorable and effective encounters. > > > > With the growth in demand and use of Machine Learning, using robots grows > > vastly. > > > > Improvements in vending machine learning algorithms are extremely likely > > to be seen from the upcoming several decades. These advancements can help > > you build improved calculations, which will result in quicker and more > > precise machine. It will result in quicker processing of information if > > quantum computers are incorporated into Machine Learning. This will > quicken > > the capacity to
Re: MathJax not renedering on Website
This has happened periodically to my sites. The answer is usually that the canonical location of the mathJax JavaScript library has changed. On Sep 10, 2017 7:58 PM, "Andrew Palumbo"wrote: > It looks like MathJax is not rendering tex on the site: > > > Eg.: > > > https://mahout.apache.org/users/algorithms/d-ssvd.html > > Ideas to get this going while site is being redone? > > >
Re: Looking for help with a talk
Any time. Ping me directly. On Fri, Aug 4, 2017 at 1:12 AM, Isabel Drost-Frommwrote: > Hi, > > I have a first draft of a narrative and slide deck. If anyone has time it > would be lovely to bounce some ideas back and forth, have the draft of the > deck reviewed. > > > Isabel > >
Re: Unsubscribe.
Glad it worked. Sad to see you go. On Thu, Jun 8, 2017 at 4:24 AM, Roshan Kedar <rosbl...@gmail.com> wrote: > Hi Ted, > > Sorry to bother you but problem was @my end. I had sent mail to > unsubscribe but the "confirmation to unsubscribe" mail was sent to trash. > > Finally I did unsubscribe. > Thanks for support. > > Regards > Roshan Kedar > > On 8 Jun 2017 03:10, "Ted Dunning" <ted.dunn...@gmail.com> wrote: > >> >> Is there a chance you subscribed under another email address. >> >> >> >> On Wed, Jun 7, 2017 at 12:40 AM, Roshan Kedar <rosbl...@gmail.com> wrote: >> >>> Hahaha, >>> >>> Two, including today's mail after your reply. >>> >>> Actually your mails are overwhelming in number. But it was nice working >>> on >>> mahout. >>> >>> But now working on totally different field for some time. So please >>> unsubscribe. >>> >>> On 7 Jun 2017 03:07, "Trevor Grant" <trevor.d.gr...@gmail.com> wrote: >>> >>> > How many times have you sent an email to >>> dev-unsubscr...@mahout.apache.org >>> > ? >>> > >>> > On Tue, Jun 6, 2017 at 4:00 PM, Roshan Kedar <rosbl...@gmail.com> >>> wrote: >>> > >>> > > And exactly how many times I have to unsubscribe from this >>> newsletter? >>> > > >>> > > Unsubscribe me please. >>> > > >>> > >>> >> >>
Re: Unsubscribe.
Is there a chance you subscribed under another email address. On Wed, Jun 7, 2017 at 12:40 AM, Roshan Kedarwrote: > Hahaha, > > Two, including today's mail after your reply. > > Actually your mails are overwhelming in number. But it was nice working on > mahout. > > But now working on totally different field for some time. So please > unsubscribe. > > On 7 Jun 2017 03:07, "Trevor Grant" wrote: > > > How many times have you sent an email to dev-unsubscribe@mahout.apache. > org > > ? > > > > On Tue, Jun 6, 2017 at 4:00 PM, Roshan Kedar wrote: > > > > > And exactly how many times I have to unsubscribe from this newsletter? > > > > > > Unsubscribe me please. > > > > > >
Re: New logo
On Sat, May 6, 2017 at 2:43 PM, Scott C. Cotewrote: > Will you be wearing “one of those t-shirts” on Monday in Houston :) ? > Not likely. It is in the archive.
Re: New logo
; >> > problems, and really statistics / "machine-learning" in general, in > >that > >> we > >> > can't find perfect solutions, yet we believe solutions exist and > >serve as > >> > our blueprint. > >> > > >> > Finally, I like that it is simple. > >> > > >> > Things I don't like about it: > >> > Lucent Technologies used it back in the 90s, however they used a > >very > >> > specific red one, and this isn't a deal breaker for me. > >> > > >> > Other thoughts: > >> > Based on the tattoo I saw- one could make an Enso using old mahout > >color > >> > palatte if one were to dab their brush in the appropriate colors. > >This > >> > could also be represented in any single color. (Not sure what that > >does > >> to > >> > our TM, is it ok if we just keep slapping TMs on the side of it? If > >that > >> is > >> > the case is there any reason we must have a single Enso?) > >> > > >> > So there is something to throw in the pot that is a little more > >grown up > >> > than my runner up favorites (honey badger, blueman riding bomb > >waving > >> > cowboy hat, blueman riding lighting bolt into a squirrel covered in > >> water, > >> > etc). > >> > > >> > Again, only know what wiki has told me, so if anyone is more > >familiar > >> with > >> > this symbol (like was it used as a logo by some horrible dictator > >which > >> > carried out terrible attrocities?) or just general comments. > >> > tg > >> > > >> > > >> > > >> > Trevor Grant > >> > Data Scientist > >> > https://github.com/rawkintrevo > >> > http://stackexchange.com/users/3002022/rawkintrevo > >> > http://trevorgrant.org > >> > > >> > *"Fortunate is he, who is able to know the causes of things." > >-Virgil* > >> > > >> > > >> > On Thu, Apr 27, 2017 at 5:50 PM, Ted Dunning > ><ted.dunn...@gmail.com> > >> wrote: > >> > > >> >> I don't have any constructive input at all. None of the proposals > >showed > >> >> any spark (to me). > >> >> > >> >> I hate it when I can't suggest a better path and I hate negative > >> feedback. > >> >> But there it is. > >> >> > >> >> > >> >> > >> >> On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel > ><p...@occamsmachete.com> > >> wrote: > >> >> > >> >>> Do you have constructive input (guidance or opinion is welcome > >input) > >> or > >> >>> would you like to discontinue the contest. If the later, -1 now. > >> >>> > >> >>> > >> >>> On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com> > >> wrote: > >> >>> > >> >>> I thought that none of the proposals were worth continuing with. > >> >>> > >> >>> > >> >>> > >> >>> On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel > ><p...@occamsmachete.com> > >> >> wrote: > >> >>> > >> >>>> Yes, -1 means you hate them all or think the designers are not > >worth > >> >>>> paying. We have to pay to continue, I’ll foot the bill > >(donations > >> >>>> appreciated) but don’t want to unless people think it will lead > >to > >> >>>> something. For me there are a couple I wouldn’t mind seeing on > >the web > >> >>> site > >> >>>> or swag and yes we do have time to try something completely > >different, > >> >>> and > >> >>>> the designers will be more willing since there is a guaranteed > >payout. > >> >>>> > >> >>>> > >> >>>> On Apr 27, 2017, at 3:30 PM, Andrew Musselman < > >> >>> andrew.mussel...@gmail.com> > >> >>>> wrote: > >> >>>> > >> >>>> I thought we were just voting on continuing this process :) > >> >>>> > >> >>>> On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant < > >> >> trevor.d.gr...@gmail.com> > >> >>>> w
Re: New logo
I haven't been active enough to feel good about an out and out -1. Put me as -0 On Thu, Apr 27, 2017 at 3:54 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Fair enough, I think Trevor feels the same. > > The blue man can continue, all it takes is a -1 > > > On Apr 27, 2017, at 3:50 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > I don't have any constructive input at all. None of the proposals showed > any spark (to me). > > I hate it when I can't suggest a better path and I hate negative feedback. > But there it is. > > > > On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > > > Do you have constructive input (guidance or opinion is welcome input) or > > would you like to discontinue the contest. If the later, -1 now. > > > > > > On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > > I thought that none of the proposals were worth continuing with. > > > > > > > > On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel <p...@occamsmachete.com> > wrote: > > > >> Yes, -1 means you hate them all or think the designers are not worth > >> paying. We have to pay to continue, I’ll foot the bill (donations > >> appreciated) but don’t want to unless people think it will lead to > >> something. For me there are a couple I wouldn’t mind seeing on the web > > site > >> or swag and yes we do have time to try something completely different, > > and > >> the designers will be more willing since there is a guaranteed payout. > >> > >> > >> On Apr 27, 2017, at 3:30 PM, Andrew Musselman < > > andrew.mussel...@gmail.com> > >> wrote: > >> > >> I thought we were just voting on continuing this process :) > >> > >> On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant <trevor.d.gr...@gmail.com > > > >> wrote: > >> > >>> Also Pat, thank you for organizing. > >>> > >>> +0 > >>> > >>> I don't love any of them enough to +1, I don't hate them all enough to > > -1 > >>> > >>> Most of them remind me of some spin on Apache Apex, Python, Numpy (a > >> Python > >>> Library), or IBM's DSX. However, I realize a big part of that is the > >>> colors chosen. > >>> > >>> #143 is my favorite (possibly because it reminds me of none of the > >> above). > >>> But possibly if this goes to next round we can have them adjust hues / > >>> colors. > >>> > >>> Trevor Grant > >>> Data Scientist > >>> https://github.com/rawkintrevo > >>> http://stackexchange.com/users/3002022/rawkintrevo > >>> http://trevorgrant.org > >>> > >>> *"Fortunate is he, who is able to know the causes of things." -Virgil* > >>> > >>> > >>> On Thu, Apr 27, 2017 at 5:15 PM, Andrew Musselman < > >>> andrew.mussel...@gmail.com> wrote: > >>> > >>>> +1 to continue; thanks for organizing this Pat! > >>>> > >>>> My personal favorite is #38 > >>>> https://images-platform.99static.com/I9quDzcBrtJXg_ > >>> NMaIsH6ySQ7Ok=/filters: > >>>> quality(100)/99designs-contests-attachments/84/84017/ > >> attachment_84017937 > >>>> > >>>> I like the stylized and simple "M" and it reminds me of diagrams > > showing > >>>> vector multiplication. > >>>> > >>>> On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel <p...@occamsmachete.com> > >>>> wrote: > >>>> > >>>>> We can treat this like a release vote, if anyone hates all these and > >>>>> doesn’t want to continue with shortlisted designers for 3 more days > >>> (the > >>>>> next step) vote -1 and say if your vote is binding (your are a PMC > >>>> member) > >>>>> > >>>>> Otherwise all are welcome to rate everything on the polls below. > >>>>> > >>>>> In this case you have 24 hours to vote > >>>>> > >>>>> Here’s my +1 to continue refining. > >>>>> > >>>>> > >>>>> On Apr 27, 2017, at 11:41 AM, Pat Ferrel <p...@occamsmachete.com> > >>> wrote: > >>>>> > >>>>> Here is a second group, hopefully picked to be unique. > >>>>> https://99designs.com/contests/poll/vl7xed > >>>>> > >>>>> We got a lot of responses, these 2 polls contain the best afaict. > >>>>> > >>>>> > >>>>> On Apr 27, 2017, at 11:25 AM, Pat Ferrel <p...@occamsmachete.com> > >>> wrote: > >>>>> > >>>>> Vote: https://99designs.com/contests/poll/rqcgif > >>>>> > >>>>> We asked for something “mathy” and asked for no elephant and rider. > We > >>>>> have the rest of the week to tweak so leave comments about what you > >>> like > >>>> or > >>>>> would like to change. > >>>>> > >>>>> We don’t have to pick one of these, so if you hate them all, make > that > >>>>> known too. > >>>>> > >>>>> > >>>>> > >>>> > >>> > >> > >> > > > > > >
Re: New logo
I don't have any constructive input at all. None of the proposals showed any spark (to me). I hate it when I can't suggest a better path and I hate negative feedback. But there it is. On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Do you have constructive input (guidance or opinion is welcome input) or > would you like to discontinue the contest. If the later, -1 now. > > > On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > I thought that none of the proposals were worth continuing with. > > > > On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > > > Yes, -1 means you hate them all or think the designers are not worth > > paying. We have to pay to continue, I’ll foot the bill (donations > > appreciated) but don’t want to unless people think it will lead to > > something. For me there are a couple I wouldn’t mind seeing on the web > site > > or swag and yes we do have time to try something completely different, > and > > the designers will be more willing since there is a guaranteed payout. > > > > > > On Apr 27, 2017, at 3:30 PM, Andrew Musselman < > andrew.mussel...@gmail.com> > > wrote: > > > > I thought we were just voting on continuing this process :) > > > > On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant <trevor.d.gr...@gmail.com> > > wrote: > > > >> Also Pat, thank you for organizing. > >> > >> +0 > >> > >> I don't love any of them enough to +1, I don't hate them all enough to > -1 > >> > >> Most of them remind me of some spin on Apache Apex, Python, Numpy (a > > Python > >> Library), or IBM's DSX. However, I realize a big part of that is the > >> colors chosen. > >> > >> #143 is my favorite (possibly because it reminds me of none of the > > above). > >> But possibly if this goes to next round we can have them adjust hues / > >> colors. > >> > >> Trevor Grant > >> Data Scientist > >> https://github.com/rawkintrevo > >> http://stackexchange.com/users/3002022/rawkintrevo > >> http://trevorgrant.org > >> > >> *"Fortunate is he, who is able to know the causes of things." -Virgil* > >> > >> > >> On Thu, Apr 27, 2017 at 5:15 PM, Andrew Musselman < > >> andrew.mussel...@gmail.com> wrote: > >> > >>> +1 to continue; thanks for organizing this Pat! > >>> > >>> My personal favorite is #38 > >>> https://images-platform.99static.com/I9quDzcBrtJXg_ > >> NMaIsH6ySQ7Ok=/filters: > >>> quality(100)/99designs-contests-attachments/84/84017/ > > attachment_84017937 > >>> > >>> I like the stylized and simple "M" and it reminds me of diagrams > showing > >>> vector multiplication. > >>> > >>> On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel <p...@occamsmachete.com> > >>> wrote: > >>> > >>>> We can treat this like a release vote, if anyone hates all these and > >>>> doesn’t want to continue with shortlisted designers for 3 more days > >> (the > >>>> next step) vote -1 and say if your vote is binding (your are a PMC > >>> member) > >>>> > >>>> Otherwise all are welcome to rate everything on the polls below. > >>>> > >>>> In this case you have 24 hours to vote > >>>> > >>>> Here’s my +1 to continue refining. > >>>> > >>>> > >>>> On Apr 27, 2017, at 11:41 AM, Pat Ferrel <p...@occamsmachete.com> > >> wrote: > >>>> > >>>> Here is a second group, hopefully picked to be unique. > >>>> https://99designs.com/contests/poll/vl7xed > >>>> > >>>> We got a lot of responses, these 2 polls contain the best afaict. > >>>> > >>>> > >>>> On Apr 27, 2017, at 11:25 AM, Pat Ferrel <p...@occamsmachete.com> > >> wrote: > >>>> > >>>> Vote: https://99designs.com/contests/poll/rqcgif > >>>> > >>>> We asked for something “mathy” and asked for no elephant and rider. We > >>>> have the rest of the week to tweak so leave comments about what you > >> like > >>> or > >>>> would like to change. > >>>> > >>>> We don’t have to pick one of these, so if you hate them all, make that > >>>> known too. > >>>> > >>>> > >>>> > >>> > >> > > > > > >
Re: New logo
I thought that none of the proposals were worth continuing with. On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrelwrote: > Yes, -1 means you hate them all or think the designers are not worth > paying. We have to pay to continue, I’ll foot the bill (donations > appreciated) but don’t want to unless people think it will lead to > something. For me there are a couple I wouldn’t mind seeing on the web site > or swag and yes we do have time to try something completely different, and > the designers will be more willing since there is a guaranteed payout. > > > On Apr 27, 2017, at 3:30 PM, Andrew Musselman > wrote: > > I thought we were just voting on continuing this process :) > > On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant > wrote: > > > Also Pat, thank you for organizing. > > > > +0 > > > > I don't love any of them enough to +1, I don't hate them all enough to -1 > > > > Most of them remind me of some spin on Apache Apex, Python, Numpy (a > Python > > Library), or IBM's DSX. However, I realize a big part of that is the > > colors chosen. > > > > #143 is my favorite (possibly because it reminds me of none of the > above). > > But possibly if this goes to next round we can have them adjust hues / > > colors. > > > > Trevor Grant > > Data Scientist > > https://github.com/rawkintrevo > > http://stackexchange.com/users/3002022/rawkintrevo > > http://trevorgrant.org > > > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > > > > On Thu, Apr 27, 2017 at 5:15 PM, Andrew Musselman < > > andrew.mussel...@gmail.com> wrote: > > > >> +1 to continue; thanks for organizing this Pat! > >> > >> My personal favorite is #38 > >> https://images-platform.99static.com/I9quDzcBrtJXg_ > > NMaIsH6ySQ7Ok=/filters: > >> quality(100)/99designs-contests-attachments/84/84017/ > attachment_84017937 > >> > >> I like the stylized and simple "M" and it reminds me of diagrams showing > >> vector multiplication. > >> > >> On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel > >> wrote: > >> > >>> We can treat this like a release vote, if anyone hates all these and > >>> doesn’t want to continue with shortlisted designers for 3 more days > > (the > >>> next step) vote -1 and say if your vote is binding (your are a PMC > >> member) > >>> > >>> Otherwise all are welcome to rate everything on the polls below. > >>> > >>> In this case you have 24 hours to vote > >>> > >>> Here’s my +1 to continue refining. > >>> > >>> > >>> On Apr 27, 2017, at 11:41 AM, Pat Ferrel > > wrote: > >>> > >>> Here is a second group, hopefully picked to be unique. > >>> https://99designs.com/contests/poll/vl7xed > >>> > >>> We got a lot of responses, these 2 polls contain the best afaict. > >>> > >>> > >>> On Apr 27, 2017, at 11:25 AM, Pat Ferrel > > wrote: > >>> > >>> Vote: https://99designs.com/contests/poll/rqcgif > >>> > >>> We asked for something “mathy” and asked for no elephant and rider. We > >>> have the rest of the week to tweak so leave comments about what you > > like > >> or > >>> would like to change. > >>> > >>> We don’t have to pick one of these, so if you hate them all, make that > >>> known too. > >>> > >>> > >>> > >> > > > >
Re: Marketing
On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrelwrote: > maybe we should drop the name Mahout altogether. I have been told that there is a cool secondary interpretation of Mahout as well. I think that the Hebrew word is pronounced roughly like Mahout. מַהוּת The cool thing is that this word means "essence" or possibly "truth". So regardless of the guy riding the elephant, Mahout still has something to be said for it. (I have no Hebrew, btw) (real speakers may want to comment here)
Re: LLR thresholds
MAP is dangerous, as are all off-line comparisons. The problem is that it tends to over-emphasize precision over recall and it tends to emphasize replicating what has been seen before. Increasing the threshold increases precision and decreases recall. But MAP mostly only cares about the top hit. In practice, you want lots of good hits in the results page. On Wed, Mar 8, 2017 at 8:18 AM, Pat Ferrelwrote: > The CCO algorithm now supports a couple ways to limit indicators by > “quality". The new way is by the value of LLR. We built a t-digest > mechanism to look at the overall density produced with different > thresholds. The higher the threshold, the lower the number of indicators > and the lower the density of the resulting indicator matrix but also the > higher the MAP score (of the full recommender). So MAP seems to increase > monotonically until it breaks down. > > This didn’t match my understanding of LLR, which is actually a test for > non-correlation. I was expecting high scores to mean highly likelihood of > non-correlation. So the actual formulation of the code must be reversing > that so the higher the score the higher the likelihood that non-correlation > is **false** (this is a treated as evidence of correlation) > > The next observation is that with high thresholds we get higher MAP scores > from the recommender (expected) but this increases monotonically until it > breaks down because there are so few indicators left. This leads us to the > conclusion that MAP is not a good way to set the threshold. We tried to > looking are precision (MAP) vs recall (number of people who get recs) and > this gave ambiguous results with the data we had. > > Given my questions about how LLR is actually formulated in Mahout I’m > unsure how to convert it into something like a confidence score or some > other way to judge the threshold that would lead to good way to choose a > threshold. Any ideas or illumination about how it’s being calculated or how > to judge the threshold? > > > > Long description of motivation: > > LLR thresholds are needed when comparing conversion events to things that > have very small dimensionality so maxIndicatorsPerIItem does not work well. > For example a location by state where there are 50, maxIndicatorsPerItem > defaults to 50 so you may end up with 50 very week indicators. If there are > strong indicators in the data, thresholds should be the way to find them. > This might lead to a few per item if the data supports it and this should > then be useful. The question above is how to choose a threshold. >
[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408387#comment-15408387 ] Ted Dunning commented on MAHOUT-1853: - [~pferrel] Computing the parameters of a normal distribution is definitely cheaper than updating a t-digest, but I doubt that the difference will be visible. It takes a few additions and divisions to update the mean and sd, while it takes 100-200ns on average to update a t-digest with a new sample. But the big win happens when the data being collected is grossly non-normal, or when the stuff of interest is an anomalous tail in an otherwise normal distribution. Both of these cases apply in this situation. > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: LLR quick clarification
It just means that there is an association. Causation is much more difficult to ascertain. On Wed, May 4, 2016 at 6:06 AM, Nikaash Puriwrote: > Hi, > > Just wanted to clarify a small doubt. On running LLR with primary > indicator as view and secondary indicator as purchase. Say, one line of the > cross-cooccurrence matrix looks as follows: > > view-purchase cross-cooccurrence matrix: > > I1 I2:0.9, I3:0.8, …….. > … > > This, in very simple terms then means that purchasing I2 should lead to > the recommendation of viewing I1, is that correct? Of course, ignoring the > other indicators for now. > > Thank you, > Nikaash Puri
Re: About reuters-fkmeans-centroids
On Thu, Apr 28, 2016 at 10:54 AM, Prakash Poudyalwrote: > Actually, I need to use fuzzy clustering to cluster the sentence in my > research. I found fuzzy k clustering algorithm in Apache Mahout, thus, I > am trying to use it for my purpose. > That's great. But that code is no longer supported.
Re: [jira] [Created] (MAHOUT-1771) Cluster dumper omits indices and 0 elements for dense vectors
On Tue, Sep 8, 2015 at 1:38 AM, Sean Owen (JIRA)wrote: > Sean Owen created MAHOUT-1771: > - > > Summary: Cluster dumper omits indices and 0 elements for > dense vectors > Key: MAHOUT-1771 > URL: https://issues.apache.org/jira/browse/MAHOUT-1771 > Project: Mahout > Issue Type: Bug > Components: Clustering, mrlegacy > Affects Versions: 0.9 > Reporter: Sean Owen > Priority: Minor > > > Blast from the past -- are patches still being accepted for "mrlegacy" > code? Something turned up incidentally when working with a customer that > looks like a minor bug in the cluster dumper code. > > In {{AbstractCluster.java}}: > > {code} > public static List formatVectorAsJson(Vector v, String[] bindings) > throws IOException { > > boolean hasBindings = bindings != null; > boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != > v.size(); > > // we assume sequential access in the output > Vector provider = v.isSequentialAccess() ? v : new > SequentialAccessSparseVector(v); > > List terms = new LinkedList<>(); > String term = ""; > > for (Element elem : provider.nonZeroes()) { > > if (hasBindings && bindings.length >= elem.index() + 1 && > bindings[elem.index()] != null) { > term = bindings[elem.index()]; > } else if (hasBindings || isSparse) { > term = String.valueOf(elem.index()); > } > > Map term_entry = new HashMap<>(); > double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000; > if (hasBindings || isSparse) { > term_entry.put(term, roundedWeight); > terms.add(term_entry); > } else { > terms.add(roundedWeight); > } > } > > return terms; > } > {code} > > Imagine a {{DenseVector}} with 5 elements, of which two are 0. It's > considered dense in this method since the number of non-default elements is > 5 (all elements are "non default" in a dense vector). > > However the iteration is over non-zero elements only. And indices are only > printed if it's sparse (or has bindings). So the result will be the 3 > non-zero elements printed without indices. Which dimensions they are can't > be determined. > > The fix seems to be either: > - Compare number of _non-zero_ elements to the size when determining if > it's sparse > - Iterate over all elements if non-sparse > > I think the first is the intent? it would be a one-line change if so. > > {code} > boolean isSparse = !v.isDense() && v.getNumZeroElements() != v.size(); > {code} > > Pretty straightforward, and minor, but wanted to check with everyone > before making a change. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
Re: Announcements
Can you set up a list of twitter handles? On Wed, Aug 19, 2015 at 1:11 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Thanks; I'd say me, Suneel, Pat, Andrew P, Dmitriy, and Stevo could use it during release time. On Monday, August 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote: Not sure if Ellen will see this email. I will forward. She is happy to share access to the Twitter account via Tweetdeck to anybody that the PMC designates. On Mon, Aug 17, 2015 at 4:05 PM, Andrew Musselman andrew.mussel...@gmail.com javascript:; wrote: Could we send out announcements through the @ApacheMahout account on Twitter? Ellen, if you need some help with that account let us know; we can make it part of the release process if we all have access to the handle. Thanks!
Re: Announcements
Not sure if Ellen will see this email. I will forward. She is happy to share access to the Twitter account via Tweetdeck to anybody that the PMC designates. On Mon, Aug 17, 2015 at 4:05 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Could we send out announcements through the @ApacheMahout account on Twitter? Ellen, if you need some help with that account let us know; we can make it part of the release process if we all have access to the handle. Thanks!
Re: July Board Report
On Sun, Jul 5, 2015 at 11:48 AM, Suneel Marthi smar...@apache.org wrote: Off late Minor typo: This should be Of late.
Re: July Board Report
On Sat, Jul 4, 2015 at 12:03 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: (1) does Samsara as code name require trademark research, legally? If so, was any research done? (I am guessing not -- not thru apache legal anyway). If Samsara were a project name, it would require research. Since it will be used with the qualifier Apache Mahout or Mahout in practice and since those already qualify, there should be no need for another search.
Re: July Board Report
I think that there should be some commentary added to the report that the project has lately had a problem with a substantial amount of off-list design discussions and that the PMC is aware of the problem and working to fix the problem. At least, I think that the PMC is aware of the problem and is working to fix it. On Sat, Jul 4, 2015 at 10:18 AM, Suneel Marthi smar...@apache.org wrote: Below is the draft of the July Board report, feedback welcome. - Report from the Apache Mahout project ## Description: The goal of Apache Mahout project is to build an environment for quickly creating scalable distributed machine learning algorithms. ## Activity: - Apache Mahout’s next generation 0.10.0 was released on April 11, 2015. A new Math environment called ‘Samsara’ for its theme of universal rejuvenation was introduced in 0.10.0 release. At Samsara’s core are general linear algebra and statistical operations with supporting data structures. Mahout-Samsara reflects a rethinking of how scalable Machine Learning algorithms are to be built and customized. - Apache Mahout 0.10.1 was released on May 31, 2015. This was a minor bug fix release following 0.10.0. - Apache Mahout now supports scalable Machine Learning on Spark, H2O and MapReduce. - The project has been working closely with Apache BigTop to integrate Apache Mahout into BigTop following a release. - Integration of Apache Mahout with Apache Flink is in the works and is being done in collaboration with Data Artisans and TU Berlin. - Ted Dunning and Suneel Marthi announced the new Mahout 0.10.0 with Spark and H2O support at BigData Everywhere (BDE) DC Conference at Tysons Corner, VA on May 13, 2015 - Anand Avati was added as a new committer. - Stevo Slavic was as a PMC member. - Team presently working on 0.10.2 release, tentatively planned for the week of July 10 2015. ## Issues: - None ## PMC/Committership changes: - Currently 25 committers and 14 PMC members in the project. - Stevo Slavić was added to the PMC on Fri May 08 2015 - Anand Avati was added as a committer on Thu Apr 23 2015 ## Releases: - 0.10.1 was released on Sun May 31 2015 - 0.10.0 was released on Sat Apr 11 2015 ## Mailing list activity: - dev@mahout.apache.org: - 977 subscribers (down -8 in the last 3 months): - 1324 emails sent to list (1419 in previous quarter) - u...@mahout.apache.org: - 1933 subscribers (down -10 in the last 3 months): - 243 emails sent to list (252 in previous quarter) - gene...@mahout.apache.org: - 10 subscribers (up 0 in the last 3 months): - 0 emails sent to list (0 in previous quarter) ## JIRA activity: - 85 JIRA tickets created in the last 3 months - 74 JIRA tickets closed/resolved in the last 3 months
[jira] [Commented] (MAHOUT-1746) Fix: mxA ^ 2, mxA ^ 0.5 to mean the same thing as mxA * mxA and mxA ::= sqrt _
[ https://issues.apache.org/jira/browse/MAHOUT-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599735#comment-14599735 ] Ted Dunning commented on MAHOUT-1746: - I think that this is more complicated than it looks. I just wrote a test and got really strange results. The rate at which x*x != Math.pow(x,2) is not constant in my test and seems like there may be strange interactions with the JIT. Fix: mxA ^ 2, mxA ^ 0.5 to mean the same thing as mxA * mxA and mxA ::= sqrt _ -- Key: MAHOUT-1746 URL: https://issues.apache.org/jira/browse/MAHOUT-1746 Project: Mahout Issue Type: Blog - New Blog Request Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 0.10.2 it so happens that in java, if x is of double type, Math.pow(x,2.0) and x * x produce different values approximately once in million random values. This is extremely annoying as it creates rounding errors, especially with things like euclidean distance computations, which eventually may produce occasional NaNs. This issue suggests to get special treatment on vector and matrix dsl to make sure identical fpu algorithms are running as follows: x ^ 2 = x * x x ^ 0.5 = sqrt(x) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: RDMA on apache mahout
Pejman, Not sure quite what you are asking. How is implementing RDMA in Drill different than adding RDMA to Spark or H2O (the backends that Mahout uses)? On Tue, Jun 23, 2015 at 4:54 AM, Pejman Hosseini pejman.invincibl...@gmail.com wrote: Hello everybody! I want to implement RDMA on Mahout as a part of my Thesis inspired by Accelerating Big Data Processing with Hadoop, Spark, and Memcached on Datacenters with Modern Architectures http://www.cse.ohio-state.edu/%7Epanda/isca15_bigdata.pdf. Unfortunately I can't find any papers or references that explain or implemente it. I wanted to know whether it is possible at all? -- *Seyyed Pejman Hosseini pejman.invincibl...@gmail.com*
Re: JIRA's with no commits
On Thu, Jun 18, 2015 at 7:08 AM, Suneel Marthi smar...@apache.org wrote: Agreed. We have been keeping all project and design discussions to dev@ mailing lists and that's still is the case. I just took a look at Slack and there is a long conversation on general about the trade-offs of matrix algorithms. Then another about the benefits or costs of multi-backend architecture. These are not discussions about release coordination. They are design discussions.
Re: JIRA's with no commits
Slack isn't the mailing list. On Wed, Jun 17, 2015 at 11:43 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: We talked about it a lot, some on Slack; was work finally approved for donation. I reviewed it, looked great. On Wednesday, June 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote: 5k lines in a single commit? No discussion on the list? On Wed, Jun 17, 2015 at 11:26 PM, Andrew Musselman andrew.mussel...@gmail.com javascript:; wrote: Sounds like part of PR 135 which is Dmitriy's 5k-line-diff drop from the other week. On Wednesday, June 17, 2015, Ted Dunning ted.dunn...@gmail.com javascript:; wrote: A lot of JIRA's are being opened and then closed with no apparent commits associated with them. For example MAHOUT-1725 adds an element-wise power operation but it was closed as fixed with no apparent discussion and with no commits attached to the JIRA. What is happening?
Re: JIRA's with no commits
5k lines in a single commit? No discussion on the list? On Wed, Jun 17, 2015 at 11:26 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Sounds like part of PR 135 which is Dmitriy's 5k-line-diff drop from the other week. On Wednesday, June 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote: A lot of JIRA's are being opened and then closed with no apparent commits associated with them. For example MAHOUT-1725 adds an element-wise power operation but it was closed as fixed with no apparent discussion and with no commits attached to the JIRA. What is happening?
Re: JIRA's with no commits
On Thu, Jun 18, 2015 at 12:36 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Capturing discussion in a public format and archiving the discussion would be preferable to fragmenting across lists, PR comments, and Slack, but the tools are all valuable, and until we find a way to build a digest for the archives I support using them all. Actually, capturing the design discussion on the list is not just preferable. It is required. Using alternative tools is fine and all, but not if it compromises that core requirement.
JIRA's with no commits
A lot of JIRA's are being opened and then closed with no apparent commits associated with them. For example MAHOUT-1725 adds an element-wise power operation but it was closed as fixed with no apparent discussion and with no commits attached to the JIRA. What is happening?
[jira] [Commented] (MAHOUT-1699) Trim down Mahout packaging for next release
[ https://issues.apache.org/jira/browse/MAHOUT-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533567#comment-14533567 ] Ted Dunning commented on MAHOUT-1699: - How many of these dependencies should actually just be put into provided scope and thus excluded from the jar entirely? Trim down Mahout packaging for next release --- Key: MAHOUT-1699 URL: https://issues.apache.org/jira/browse/MAHOUT-1699 Project: Mahout Issue Type: Improvement Components: build Affects Versions: 0.10.0 Reporter: Suneel Marthi Priority: Critical Fix For: 0.10.1 Mahout 0.10.0 package size is 210MB, this needs to be trimmed down to a more manageable size. This also makes it hard to package Mahout into the BigTop distro and not to mention seeking an infra waiver at the time of release for the 200MB size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Just noticed that web sites can be git based
There is also a proposal afoot to withdraw some of the CMS service. The pubsub service that publishes the html would remain. On Wed, May 6, 2015 at 3:40 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: The markup and publish process is what I wonder about; the current CMS may be klunky but it does work and provide staging and checkpointing. On Wednesday, May 6, 2015, Pat Ferrel p...@occamsmachete.com wrote: https://docs.prediction.io/resources/intellij/ Notice the blue edit button, bottom right. All it does is take you to the page on github but hitting edit there leads you through editing and creates the correct PR to their “livedocs” branch. No idea what their publish process is, but with a PR it seems like we can do a merge to the ASF git repo and get it published through the ASF process. On May 5, 2015, at 10:25 AM, Ted Dunning ted.dunn...@gmail.com javascript:; wrote: Can you give a pointer to such an icon? On Tue, May 5, 2015 at 6:16 PM, Pat Ferrel p...@occamsmachete.com javascript:; wrote: I asked to sign us up when this was first announced but haven’t heard back. On another project I hit an “edit” icon on their site, which automatically sent me to the page on github, where I was allowed to edit. This automatically created a branch in my repo and a pr to the correct branch of their repo. Very convenient. That way an edit icon can be put on every Mahout CMS page and users will find requesting some rewording quite easy. Notice that no write access is required since edits go through a PR. Not sure if the ASF implementation does this, but would be nice. On May 3, 2015, at 9:58 AM, Ted Dunning ted.dunn...@gmail.com javascript:; wrote: https://blogs.apache.org/infra/entry/git_based_websites_available This might be nice to get rid of the svn step in web site updates. It would involve an alternative workflow for updates rather than the CMS process.
Re: Just noticed that web sites can be git based
Can you give a pointer to such an icon? On Tue, May 5, 2015 at 6:16 PM, Pat Ferrel p...@occamsmachete.com wrote: I asked to sign us up when this was first announced but haven’t heard back. On another project I hit an “edit” icon on their site, which automatically sent me to the page on github, where I was allowed to edit. This automatically created a branch in my repo and a pr to the correct branch of their repo. Very convenient. That way an edit icon can be put on every Mahout CMS page and users will find requesting some rewording quite easy. Notice that no write access is required since edits go through a PR. Not sure if the ASF implementation does this, but would be nice. On May 3, 2015, at 9:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: https://blogs.apache.org/infra/entry/git_based_websites_available This might be nice to get rid of the svn step in web site updates. It would involve an alternative workflow for updates rather than the CMS process.
Just noticed that web sites can be git based
https://blogs.apache.org/infra/entry/git_based_websites_available This might be nice to get rid of the svn step in web site updates. It would involve an alternative workflow for updates rather than the CMS process.
Re: dependency-reduced jar
THe support commitment for t-digest either via stream-lib or directly from the t-digest jar is the same. I support it. Stream-lib is a bit behind because they don't update the dependency as often. Otherwise, it is exactly the same software and exactly the same support. On Sat, May 2, 2015 at 2:41 PM, Andrew Palumbo ap@outlook.com wrote: On 05/02/2015 10:48 AM, Pat Ferrel wrote: Not removing Guava or any other dependencies from the jar. I don’t have time right now to fix all those Preconditions that might allow Guava to be removed and the other classes are needed by various Spark client code. +1 to dealing with the Guava precondition and assembly stuff in an other issue. Again, I propose we factor this into client and worker jars. Removing Preconditions may allow us to do away with the Worker jar altogether since guava is not used in Scala now. On May 1, 2015, at 2:18 PM, Pat Ferrel p...@occamsmachete.com wrote: removing guava shows up a bunch of uses of google Preconditions in math. Guess I’ll have to remove those. I’ll leave mr and the rest alone since only math code gets run on a spark worker. On May 1, 2015, at 10:01 AM, Andrew Palumbo ap@outlook.com wrote: ResultAnalyzer is Also used in SparkNaiveBayes.test (...). Sent from my Verizon Wireless 4G LTE smartphone div Original message /divdivFrom: Andrew Palumbo ap@outlook.com /divdivDate:05/01/2015 12:57 PM (GMT-05:00) /divdivTo: dev@mahout.apache.org /divdivSubject: RE: dependency-reduced jar /divdiv /div I added T-digest and math3. the CLI Naive Bayes driver needs them. Specifically the ResultAnalyzer in TestNBDriver. Sent from my Verizon Wireless 4G LTE smartphone div Original message /divdivFrom: Suneel Marthi suneel.mar...@gmail.com /divdivDate:05/01/2015 12:14 PM (GMT-05:00) /divdivTo: mahout dev@mahout.apache.org /divdivSubject: Re: dependency-reduced jar /divdiv /divT-digest is being used in Mahout-MR, I believe its also packaged as part of Spark - AddThis jar. On Fri, May 1, 2015 at 12:11 PM, Pat Ferrel p...@occamsmachete.com wrote: There is an assembly xml in mahout/spark/src/main/assembly/dependency-reduced.xml. It contains dependencies that are external to mahout but required for either the client or backend executor distributed code. Guava has recently been removed but scopt is still used by the client. For some reason the following artifacts were added to the assembly and I’m not sure why. This is only used with Spark.
Re: bringing back the fp-growth code in mahout
On Mon, Apr 27, 2015 at 8:13 PM, ray rtmel...@gmail.com wrote: What is the best way to tell if Apache code is being maintained, in particular the fp-growth algorithm in Spark's MLlib? Ask on the appropriate mailing list.
Re: bringing back the fp-growth code in mahout
Ray, Is the Spark implementation usable? Is it maintained? If not, there is a decent reason to move forward. I don't think that we want to revive the old map-reduce implementation. On Mon, Apr 27, 2015 at 5:48 AM, ray rtmel...@gmail.com wrote: I had it in mind to volunteer to maintain the fp-growth code in Mahout, but I see that Spark has an fp-growth implementation. So now that I have the time to work on this, I'm wondering if there is any point, or if there is still any interest in the Mahout community. If not, so be it. If so, I volunteer. Regards, Ray.
Re: [jira] [Created] (BIGTOP-1831) Upgrade Mahout to 0.10
Yeah... things have changed pretty radically. There is a whole bunch of new Scala based code. On Sun, Apr 26, 2015 at 11:22 AM, Konstantin Boudnik c...@apache.org wrote: Hey Andrew. I believe the upgrade from 0.9 to 0.10 on our side should be simple enough. Unless you guys have changed the structure of the build, or the build system itself or something similarly drastic. Do you have any input on this? Thanks Cos P.S. Thanks for the slack channel - it might come handy! On Fri, Apr 24, 2015 at 09:26PM, Andrew Musselman wrote: I'm not educated enough in what has to happen but we're happy to help. Are there things we need to do from the Mahout end or is it changing recipes and doing regressions of BigTop builds, etc., what else? On Friday, April 24, 2015, Konstantin Boudnik c...@apache.org wrote: I am trying to see if anyone is doing the accomodation of 0.10 into coming 1.0 release. That's pretty much a release blocker at this point. I am not very much concerned about Spark compat, but if we to take 0.10 into 1.0 it needs to work and be tested against 2.6.0 Hadoop. So, does anyone works on the patch or this JIRA? Cos On Fri, Apr 24, 2015 at 05:48PM, Andrew Musselman wrote: The spark 1.3 compat is in a near future release; what do you need from us to make 1.1 and 1.2 compat work? On Thursday, April 23, 2015, Konstantin Boudnik (JIRA) j...@apache.org javascript:; wrote: [ https://issues.apache.org/jira/browse/BIGTOP-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510075#comment-14510075 ] Konstantin Boudnik commented on BIGTOP-1831: How is it going guys? Looks like this is one of the blockers for 1.0 as we can not use old 0.9 version. Appreciate the help! Thank you! Upgrade Mahout to 0.10 -- Key: BIGTOP-1831 URL: https://issues.apache.org/jira/browse/BIGTOP-1831 Project: Bigtop Issue Type: Task Components: general Affects Versions: 0.8.0 Reporter: David Starina Priority: Blocker Labels: Mahout Fix For: 1.0.0 Need to upgrade Mahout to the latest 0.10 release (first Hadoop 2.x compatible release) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Streaming and incremental cooccurrence
Sounds about right. My guess is that memory is now large enough, especially on a cluster that the cooccurrence will fit into memory quite often. Taking a large example of 10 million items and 10,000 cooccurrences each, there will be 100 billion cooccurrences to store which shouldn't take more than about half a TB of data if fully populated. This isn't that outrageous any more. With SSD's as backing store, even 100GB of RAM or less might well produce very nice results. Depending on incoming transaction rates, using spinning disk as a backing store might also work with small memory. Experiments are in order. On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel p...@occamsmachete.com wrote: Ok, seems right. So now to data structures. The input frequency vectors need to be paired with each input interaction type and would be nice to have as something that can be copied very fast as they get updated. Random access would also be nice but iteration is not needed. Over time they will get larger as all items get interactions, users will get more actions and appear in more vectors (with multi-intereaction data). Seems like hashmaps? The cooccurrence matrix is more of a question to me. It needs to be updatable at the row and column level, and random access for both row and column would be nice. It needs to be expandable. To keep it small the keys should be integers, not full blown ID strings. There will have to be one matrix per interaction type. It should be simple to update the Search Engine to either mirror the matrix of use it directly for index updates. Each indicator update should cause an index update. Putting aside speed and size issues this sounds like a NoSQL DB table that is cached in-memeory. On Apr 23, 2015, at 3:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel p...@occamsmachete.com wrote: This seems to violate the random choice of interactions to cut but now that I think about it does a random choice really matter? It hasn't ever mattered such that I could see. There is also some reason to claim that earliest is best if items are very focussed in time. Of course, the opposite argument also applies. That leaves us with empiricism where the results are not definitive. So I don't think that it matters, but I don't think that it does.
Re: Streaming and incremental cooccurrence
On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel p...@occamsmachete.com wrote: This seems to violate the random choice of interactions to cut but now that I think about it does a random choice really matter? It hasn't ever mattered such that I could see. There is also some reason to claim that earliest is best if items are very focussed in time. Of course, the opposite argument also applies. That leaves us with empiricism where the results are not definitive. So I don't think that it matters, but I don't think that it does.
Re: Streaming and incremental cooccurrence
On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel p...@occamsmachete.com wrote: I think we have been talking about an idea that does an incremental approximation, then a refresh every so often to remove any approximation so in an ideal world we need both. Actually, the method I was pushing is exact. If the sampling is made deterministic using clever seeds, then deletion is even possible since you can determine whether an observation was thrown away rather than used to increment counts. The only creeping crud aspect of this is the accumulation of zero rows as things fall out of the accumulation window. I would be tempted to not allow deletion and just restart as Pat is suggesting.
Re: Streaming and incremental cooccurrence
Inline On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel p...@occamsmachete.com wrote: Short answer, you are correct this is not a new filter. The Hadoop MapReduce implements: * maxSimilaritiesPerItem * maxPrefs * minPrefsPerUser * threshold Scala version: * maxSimilaritiesPerItem I think of this as column-wise, but that may be bad terminology. * maxPrefs And I think of this as row-wise or user limit. I think it is the interaction-cut from the paper. The paper talks about an interaction-cut, and describes it with There is no significant decrease in the error for incorporating more interactions from the ‘power users’ after that.” While I’d trust your reading better than mine I thought that meant dowsampling overactive users. I agree. However both the Hadoop Mapreduce and the Scala version downsample both user and item interactions by maxPrefs. So you are correct, not a new thing. The paper also talks about the threshold and we’ve talked on the list about how better to implement that. A fixed number is not very useful so a number of sigmas was proposed but is not yet implemented. I think that both minPrefsPerUser and threshold have limited utility in the current code. Could be wrong about that. With low quality association measures that suffer from low count problems or simplisitic user-based methods, minPrefsPerUser can be crucial. Threshold can also be required for systems like that. The Scala code doesn't have that problem since it doesn't support those metrics.
Re: Streaming and incremental cooccurrence
Andrew Take a look at the slides I posted. In them I showed that the update does not grow beyond a very reasonable bound. Sent from my iPhone On Apr 18, 2015, at 9:15, Andrew Musselman andrew.mussel...@gmail.com wrote: Yes that's what I mean; if the number of updates gets too big it probably would be unmanageable though. This approach worked well with daily updates, but never tried it with anything real time. On Saturday, April 18, 2015, Pat Ferrel p...@occamsmachete.com wrote: I think you are saying that instead of val newHashMap = lastHashMap ++ updateHashMap, layered updates might be useful since new and last are potentially large. Some limit of updates might trigger a refresh. This might work if the update works with incremental index updates in the search engine. Given practical considerations the updates will be numerous and nearly empty. On Apr 17, 2015, at 7:58 PM, Andrew Musselman andrew.mussel...@gmail.com javascript:; wrote: I have not implemented it for recommendations but a layered cache/sieve structure could be useful. That is, between batch refreshes you can keep tacking on new updates in a cascading order so values that are updated exist in the newest layer but otherwise the lookup goes for the latest updated layer. You can put a fractional multiplier on older layers for aging but again I've not implemented it. On Friday, April 17, 2015, Ted Dunning ted.dunn...@gmail.com javascript:; wrote: Yes. Also add the fact that the nano batches are bounded tightly in size both max and mean. And mostly filtered away anyway. Aging is an open question. I have never seen any effect of alternative sampling so I would just assume keep oldest which just tosses more samples. Then occasionally rebuild from batch if you really want aging to go right. Search updates any more are true realtime also so that works very well. Sent from my iPhone On Apr 17, 2015, at 17:20, Pat Ferrel p...@occamsmachete.com javascript:; javascript:; wrote: Thanks. This idea is based on a micro-batch of interactions per update, not individual ones unless I missed something. That matches the typical input flow. Most interactions are filtered away by frequency and number of interaction cuts. A couple practical issues In practice won’t this require aging of interactions too? So wouldn’t the update require some old interaction removal? I suppose this might just take the form of added null interactions representing the geriatric ones? Haven’t gone through the math with enough detail to see if you’ve already accounted for this. To use actual math (self-join, etc.) we still need to alter the geometry of the interactions to have the same row rank as the adjusted total. In other words the number of rows in all resulting interactions must be the same. Over time this means completely removing rows and columns or allowing empty rows in potentially all input matrices. Might not be too bad to accumulate gaps in rows and columns. Not sure if it would have a practical impact (to some large limit) as long as it was done, to keep the real size more or less fixed. As to realtime, that would be under search engine control through incremental indexing and there are a couple ways to do that, not a problem afaik. As you point out the query always works and is real time. The index update must be frequent and not impact the engine's availability for queries. On Apr 17, 2015, at 2:46 PM, Ted Dunning ted.dunn...@gmail.com javascript:; javascript:; wrote: When I think of real-time adaptation of indicators, I think of this: http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel p...@occamsmachete.com javascript:; javascript:; wrote: I’ve been thinking about Streaming (continuous input) and incremental coccurrence. As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the time window. I’ve done this with the current cooccurrence code but though streaming, this is not incremental. The current data flow goes from interaction input to geometry and user dictionary reconciliation to A’A, A’B etc. After the multiply the resulting cooccurrence matrices are LLR weighted/filtered/down-sampled. Incremental can mean all sorts of things and may imply different trade-offs. Did you have anything specific in mind?
Re: Structure-based a %*% b optimization results.
Sadly, no, since that was from a different job. But here are some references with snippets: This one indicates that things have changed dramatically even just from 2009: http://www.cs.cornell.edu/~bindel/class/cs6210-f12/notes/lec02.pdf This next is a web aside from a pretty good looking book [1] http://csapp.cs.cmu.edu/2e/waside/waside-blocking.pdf I would guess that Samsara's optimizer could well do blocking as well as the transpose transformations that Dmitriy is talking about. [1] http://csapp.cs.cmu.edu/ On Fri, Apr 17, 2015 at 10:24 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Ted you have any sample code snippets? On Friday, April 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote: This does look good. One additional thought would be to do a standard multi-level blocking implementation of matrix times. In my experience this often makes orientation much less important. The basic reason is that dense times requires n^3 ops but only n^2 memory operations. By rearranging the loops you get reuse in registers and then reuse in L1 and L2. The win that you are getting now is due to cache lines being fully used rather than partially used and then lost before they are touched again. The last time I did this, there were only three important caching layers. Registers. Cache. Memory. There might be more now. Done well, this used to buy 10x speed. Might even buy more, especially with matrices that blow L2 or even L3. Sent from my iPhone On Apr 17, 2015, at 17:26, Dmitriy Lyubimov dlie...@gmail.com javascript:; wrote: Spent an hour on this today. What i am doing: simply reimplementing pairwise dot-product algorithm in stock dense matrix times(). However, equipping every matrix with structure flavor (i.e. dense(...) reports row-wise , and dense(...).t reports column wise, dense().t.t reports row-wise again, etc.) Next, wrote a binary operator that switches on combination of operand orientation and flips misaligned operand(s) (if any) to match most speedy orientation RW-CW. here are result for 300x300 dense matrix pairs: Ad %*% Bd: (107.125,46.375) Ad' %*% Bd: (206.475,39.325) Ad %*% Bd': (37.2,42.65) Ad' %*% Bd': (100.95,38.025) Ad'' %*% Bd'': (120.125,43.3) these results are for transpose combinations of original 300x300 dense random matrices, averaged over 40 runs (so standard error should be well controlled), in ms. First number is stock times() application (i.e. what we'd do with %*% operator now), and second number is ms with rewriting matrices into RW-CW orientation. For example, AB reorients B only, just like A''B'', AB' reorients nothing, and worst case A'B re-orients both (I also tried to run sum of outer products for A'B case without re-orientation -- apparently L1 misses far outweigh costs of reorientation there, i got very bad results there for outer product sum). as we can see, stock times() version does pretty bad for even dense operands for any orientation except for the optimal. Given that, i am inclined just to add orientation-driven structure optimization here and replace all stock calls with just orientation adjustment. Of course i will need to append this matrix with sparse and sparse row matrix combination (quite a bit of those i guess) and see what happens compared to stock sparse multiplications. But even that seems like a big win to me (basically, just doing reorientation optimization seems to give 3x speed up on average in matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
Re: Streaming and incremental cooccurrence
On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel p...@occamsmachete.com wrote: You seem to be proposing a new cut by frequency of item interaction, is this correct? This is because the frequency is known before the multiply and LLR. I assume the #2 cut is left in place? Yes. but I didn't think it was new.
Re: Additional Travis-CI Capacity
It is a piece of cake for simple builds. It required setting up a config file that is seen by travis ci on the github repo. If you use a maven build, this is dead simple. Here, for instance, is the entire config for the t-digest process from the .travis.yml file: language: java jdk: - oraclejdk7 - openjdk7 I had to tell travis to look at the project but that was it. Much simpler than, say, Jenkins. Bound to be less flexible as well, but if it does what I want and is more reliable because of fewer corner cases, how bad can it be to lose flexibility that I wouldn't use? On Fri, Apr 17, 2015 at 3:28 AM, Andrew Musselman a...@apache.org wrote: We're asking ourselves the same thing on dev@mahout. On Thursday, April 16, 2015, Konstantin Boudnik c...@apache.org wrote: How much work it is to re-implement everything in the new platform? Anyone has any experience with it? Cos On Thu, Apr 16, 2015 at 05:20PM, Roman Shaposhnik wrote: Is this something that we may want to look at? Thanks, Roman. -- Forwarded message -- From: David Nalley da...@gnsa.us javascript:; Date: Wed, Apr 15, 2015 at 3:33 PM Subject: Additional Travis-CI Capacity To: bui...@apache.org javascript:; bui...@apache.org javascript:; FYI: https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
Re: Streaming and incremental cooccurrence
When I think of real-time adaptation of indicators, I think of this: http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel p...@occamsmachete.com wrote: I’ve been thinking about Streaming (continuous input) and incremental coccurrence. As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the time window. I’ve done this with the current cooccurrence code but though streaming, this is not incremental. The current data flow goes from interaction input to geometry and user dictionary reconciliation to A’A, A’B etc. After the multiply the resulting cooccurrence matrices are LLR weighted/filtered/down-sampled. Incremental can mean all sorts of things and may imply different trade-offs. Did you have anything specific in mind?
Re: [VOTE] Add Travis-CI for Mahout
On Fri, Apr 17, 2015 at 6:32 PM, Pat Ferrel p...@occamsmachete.com wrote: Doesn’t Apache have some draconian requirement to control all bits of the project pipeline and workflow? No. Apache has a strict policy about *hosting* all of the bits that users of the software consume. That means that authoritative version history. And the released bits. Using outside tools either automated or manual is a fine thing.
Re: Structure-based a %*% b optimization results.
This does look good. One additional thought would be to do a standard multi-level blocking implementation of matrix times. In my experience this often makes orientation much less important. The basic reason is that dense times requires n^3 ops but only n^2 memory operations. By rearranging the loops you get reuse in registers and then reuse in L1 and L2. The win that you are getting now is due to cache lines being fully used rather than partially used and then lost before they are touched again. The last time I did this, there were only three important caching layers. Registers. Cache. Memory. There might be more now. Done well, this used to buy 10x speed. Might even buy more, especially with matrices that blow L2 or even L3. Sent from my iPhone On Apr 17, 2015, at 17:26, Dmitriy Lyubimov dlie...@gmail.com wrote: Spent an hour on this today. What i am doing: simply reimplementing pairwise dot-product algorithm in stock dense matrix times(). However, equipping every matrix with structure flavor (i.e. dense(...) reports row-wise , and dense(...).t reports column wise, dense().t.t reports row-wise again, etc.) Next, wrote a binary operator that switches on combination of operand orientation and flips misaligned operand(s) (if any) to match most speedy orientation RW-CW. here are result for 300x300 dense matrix pairs: Ad %*% Bd: (107.125,46.375) Ad' %*% Bd: (206.475,39.325) Ad %*% Bd': (37.2,42.65) Ad' %*% Bd': (100.95,38.025) Ad'' %*% Bd'': (120.125,43.3) these results are for transpose combinations of original 300x300 dense random matrices, averaged over 40 runs (so standard error should be well controlled), in ms. First number is stock times() application (i.e. what we'd do with %*% operator now), and second number is ms with rewriting matrices into RW-CW orientation. For example, AB reorients B only, just like A''B'', AB' reorients nothing, and worst case A'B re-orients both (I also tried to run sum of outer products for A'B case without re-orientation -- apparently L1 misses far outweigh costs of reorientation there, i got very bad results there for outer product sum). as we can see, stock times() version does pretty bad for even dense operands for any orientation except for the optimal. Given that, i am inclined just to add orientation-driven structure optimization here and replace all stock calls with just orientation adjustment. Of course i will need to append this matrix with sparse and sparse row matrix combination (quite a bit of those i guess) and see what happens compared to stock sparse multiplications. But even that seems like a big win to me (basically, just doing reorientation optimization seems to give 3x speed up on average in matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
Re: [VOTE] Add Travis-CI for Mahout
I use it for t-digest and like it a lot. There are some strict bounds on how much resource you are supposed to consume. Mileage may vary. On Fri, Apr 17, 2015 at 12:23 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Would this be an additional CI we would like to add to Mahout ? https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci I am up for it. +1
Re: Next version
A word of warning about making decisions off-list and without a permanent record on the mailing list. I will likely be available, but may not be. I am happy with whatever the consensus is (with a tilt towards frequent releases), but would like to see most of the decision process on the list. On Tue, Apr 14, 2015 at 4:44 AM, Suneel Marthi suneel.mar...@gmail.com wrote: We should talk about this. Could the team slack tomorrow 1PM Eastern Time to talk this out and also finalize scope for the next one? On Mon, Apr 13, 2015 at 9:14 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: i thought we wanted to do 0.10.1 with a quicker release cycle and bugfixes? On Sun, Apr 12, 2015 at 6:47 AM, Suneel Marthi suneel.mar...@gmail.com wrote: On Sun, Apr 12, 2015 at 8:56 AM, Stevo Slavić ssla...@gmail.com wrote: Hello team, Should next version be 0.10.1 or 0.11.0? I am fine with just 0.11 Thinking maybe 0.11.0 is more suitable, if it's going to contain artifact name changes like MAHOUT-1680 and MAHOUT-1681, and fundamental new features, so we keep minor releases for backward compatible bug fix releases only. Btw, it would be good (whoever has privileges) to have versions in JIRA project sorted out: - mark 0.10.0 as released - remove two empty 1.0-snapshot versions - move 1.0 to the top and clear its release date - move 0.10.1/0.11.0 under 1.0 and after 0.10.0 Stevo, u should have permissions now to fix all of the above. - maybe plan and set 0.10.1/0.11.0 expected release date (Suneel was mentioning it would be nice to integrate with Apache Flink by October timely for http://lanyrd.com/2015/flink-forward/ ) This would definitely be a good story to present at http://lanyrd.com/2015/flink-forward/ The Flink team is ready to dedicate resources from their camp to work with us. Kind regards, Stevo Slavic.
Re: Next version
On Tue, Apr 14, 2015 at 8:49 AM, Stevo Slavić ssla...@gmail.com wrote: I'm not sure but I doubt there's anything in Apache way of doing things, that's preventing us from having both 0.10.1 and 0.11.0 releases planned and worked on in parallel with dedicated branches e.g. master for next major.minor/non-bug-fix release, and branches for bug fix supported versions like 0.10 or 0.10.x. One can create a 0.10.x branch from 0.10.0 release tag. Changes there have to be regularly merged to master. This is entirely up to the project from the Apache view point. (and speaking as a project member, it sounds like a good idea)
Re: [VOTE] Apache Mahout 0.10.0 Release
Quick reminder for the next release: It is important that at least one set of eyes examine the licensing aspects of the release. This includes running RAT, making sure that bigs and bobs are named accurately and that the NOTICE and LICENSE files are correct. We should have different people check different things next time. On Sat, Apr 11, 2015 at 11:25 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Thanks everyone. We have had 5 +1 votes from the PMC and this release has passed and the Voting officially closes. Will send a formal release announcement once the release is finalized. Thanks again. On Sat, Apr 11, 2015 at 12:20 PM, Pat Ferrel p...@occamsmachete.com wrote: Just built an external app using sbt against the staging repo and it looks good to me +1 (binding) On Apr 11, 2015, at 9:12 AM, Andrew Palumbo ap@outlook.com wrote: After testing examples locally from .tar and .zip distribution and testing the staged mahout-math artifact in a java application, I am happy with this release. +1 (binding) On 04/11/2015 11:45 AM, Suneel Marthi wrote: After checking the {source} * {tar,zip} and running a few tests locally, I am fine with this release. +1 (binding) On Sat, Apr 11, 2015 at 11:43 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: After checking the binary tarball and zip, and running through all the examples on an EMR cluster, I am good with this release. +1 (binding) On Fri, Apr 10, 2015 at 9:34 PM, Ted Dunning ted.dunn...@gmail.com wrote: Ah... forgot this. +1 (binding) On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: I downloaded and tested the signatures and check-sums on {binary,source} x {zip,tar} + pom. All were correct. One thing that I worry a little about is that the name of the artifact doesn't include apache. Not sure that is a hard requirement, but it seems a good thing to do. On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi suneel.mar...@gmail.com wrote: Here's a new Mahout 0.10.0 Release Candidate at https://repository.apache.org/content/repositories/orgapachemahout-1007/ The Voting for this ends on tomorrow. Need atleast 3 PMC +1 for the release to pass. Grant, Ted: Would appreciate if u guys could verify the signatures. Rest: Please test the artifacts. Thanks to all the contributors and committers. Regards, Suneel On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel p...@occamsmachete.com wrote: Ran well but we have a packaging problem with the binary distro. Will require either a pom or code change I think, hold the vote. On Apr 9, 2015, at 4:31 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Running on EMR now. On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel p...@occamsmachete.com wrote: I can't run it (due to messed up dev machine) but I verified the artifacts buildiing an external app with sbt using the staged repo instead of my local .m2 cache. This means the Scala classes were resolved correctly from the artifacts. Hope someone can actually run it on a cluster On Apr 9, 2015, at 2:42 PM, Suneel Marthi suneel.mar...@gmail.com wrote: Please find the Mahout 0.10.0 release candidate at https://repository.apache.org/content/repositories/orgapachemahout-1005/ The Voting runs till Saturday, April 11 2015, need atleast 3 PMC +1 votes for the candidate release to pass. Thanks again to all the commiters and contributors for their hard work over the past few weeks. Regards, Suneel On Behalf of Apache Mahout Team
Re: [VOTE] Apache Mahout 0.10.0 Release
I downloaded and tested the signatures and check-sums on {binary,source} x {zip,tar} + pom. All were correct. One thing that I worry a little about is that the name of the artifact doesn't include apache. Not sure that is a hard requirement, but it seems a good thing to do. On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi suneel.mar...@gmail.com wrote: Here's a new Mahout 0.10.0 Release Candidate at https://repository.apache.org/content/repositories/orgapachemahout-1007/ The Voting for this ends on tomorrow. Need atleast 3 PMC +1 for the release to pass. Grant, Ted: Would appreciate if u guys could verify the signatures. Rest: Please test the artifacts. Thanks to all the contributors and committers. Regards, Suneel On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel p...@occamsmachete.com wrote: Ran well but we have a packaging problem with the binary distro. Will require either a pom or code change I think, hold the vote. On Apr 9, 2015, at 4:31 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Running on EMR now. On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel p...@occamsmachete.com wrote: I can't run it (due to messed up dev machine) but I verified the artifacts buildiing an external app with sbt using the staged repo instead of my local .m2 cache. This means the Scala classes were resolved correctly from the artifacts. Hope someone can actually run it on a cluster On Apr 9, 2015, at 2:42 PM, Suneel Marthi suneel.mar...@gmail.com wrote: Please find the Mahout 0.10.0 release candidate at https://repository.apache.org/content/repositories/orgapachemahout-1005/ The Voting runs till Saturday, April 11 2015, need atleast 3 PMC +1 votes for the candidate release to pass. Thanks again to all the commiters and contributors for their hard work over the past few weeks. Regards, Suneel On Behalf of Apache Mahout Team
Re: [VOTE] Apache Mahout 0.10.0 Release
Ah... forgot this. +1 (binding) On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: I downloaded and tested the signatures and check-sums on {binary,source} x {zip,tar} + pom. All were correct. One thing that I worry a little about is that the name of the artifact doesn't include apache. Not sure that is a hard requirement, but it seems a good thing to do. On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi suneel.mar...@gmail.com wrote: Here's a new Mahout 0.10.0 Release Candidate at https://repository.apache.org/content/repositories/orgapachemahout-1007/ The Voting for this ends on tomorrow. Need atleast 3 PMC +1 for the release to pass. Grant, Ted: Would appreciate if u guys could verify the signatures. Rest: Please test the artifacts. Thanks to all the contributors and committers. Regards, Suneel On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel p...@occamsmachete.com wrote: Ran well but we have a packaging problem with the binary distro. Will require either a pom or code change I think, hold the vote. On Apr 9, 2015, at 4:31 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Running on EMR now. On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel p...@occamsmachete.com wrote: I can't run it (due to messed up dev machine) but I verified the artifacts buildiing an external app with sbt using the staged repo instead of my local .m2 cache. This means the Scala classes were resolved correctly from the artifacts. Hope someone can actually run it on a cluster On Apr 9, 2015, at 2:42 PM, Suneel Marthi suneel.mar...@gmail.com wrote: Please find the Mahout 0.10.0 release candidate at https://repository.apache.org/content/repositories/orgapachemahout-1005/ The Voting runs till Saturday, April 11 2015, need atleast 3 PMC +1 votes for the candidate release to pass. Thanks again to all the commiters and contributors for their hard work over the past few weeks. Regards, Suneel On Behalf of Apache Mahout Team
Re: Professional services
Actually, I should change my line to: MapR Technologiessa...@maprtech.comFull commercial support On Fri, Apr 3, 2015 at 4:58 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Anyone else want their contact info on this page? Frank, what URL would you like to use; that one 404s.
[jira] [Reopened] (MAHOUT-1672) Update OnlineSummarizer to use the new T-Digest
[ https://issues.apache.org/jira/browse/MAHOUT-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reopened MAHOUT-1672: - If I get a 3.1 release out before Sunday, I would like to use that. No code changes will be required, just the pom. Update OnlineSummarizer to use the new T-Digest Key: MAHOUT-1672 URL: https://issues.apache.org/jira/browse/MAHOUT-1672 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.9 Reporter: Suneel Marthi Assignee: Suneel Marthi Priority: Trivial Fix For: 0.10.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1668) Automate release process
[ https://issues.apache.org/jira/browse/MAHOUT-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393127#comment-14393127 ] Ted Dunning commented on MAHOUT-1668: - Signing cannot be done on shared hardware that you don't control. That still leaves a vat of stuff that can be done by the automated system, but you need some way for the release manager to verify that the bits in the release are exactly what is expected. Automate release process Key: MAHOUT-1668 URL: https://issues.apache.org/jira/browse/MAHOUT-1668 Project: Mahout Issue Type: Task Reporter: Stevo Slavic Assignee: Stevo Slavic Fix For: 0.10.0 0.10.0 will be first release since project switched to git. Some changes have to be made in build scripts to support the release process, the Apache way. As consequence, how-to-make-release docs will likely need to be updated as well. Also, it would be nice to automate release process as much as possible, e.g. via dedicated Jenkins build job(s), so it's easy for any committer to cut out a release for vote, and after vote either finalize release or easily make a new RC - this will enable us to release faster and more often. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: MapR repo might need to be updated
(moving dev@mahout to bcc since this is not of widespread interest) Stevo, Here is what our builds guy says: Our version of nexus is 2.3.1. The last update to the repo was Friday. Because the error listed a cookie issue, I restarted apache. I have two builds building right now and pulling from the repo, no issues, yet. Can you say if the problem persists? On Mon, Mar 30, 2015 at 2:34 PM, Stevo Slavić ssla...@gmail.com wrote: Hello Ted, MapR Maven repository manager, seems to be Nexus, and it seems to be version 2.11.1 or older with this bug still in it: https://issues.sonatype.org/browse/NEXUS-7877 Mahout build uses MapR Maven repository, and for all artifacts/dependencies resolved from it, build output is polluted with warnings like: Downloading: http://repository.mapr.com/maven/org/apache/apache/16/apache-16.pom Mar 30, 2015 11:20:48 PM org.apache.maven.wagon.providers.http.httpclient.client.protocol.ResponseProcessCookies processCookies WARNING: Cookie rejected [rememberMe=deleteMe, version:0, domain: repository.mapr.com, path:/nexus, expiry:Mon Mar 30 23:20:48 CEST 2015] Illegal path attribute /nexus. Path of origin: /maven/org/apache/apache/16/apache-16.pom Please consider having it updated. Kind regards, Stevo Slavic.
Re: Anyone using eclipse?
Idea here as well. On Mon, Mar 30, 2015 at 4:52 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Idea here On Mon, Mar 30, 2015 at 4:42 PM, Andrew Palumbo ap@outlook.com wrote: also using idea On 03/30/2015 07:18 PM, Dmitriy Lyubimov wrote: I switched to idea since i started doing mixed projects with scala. Standalone scala is bearable in eclipse but mixed projects simply don't work. (and Mahout likely one of them). On Mon, Mar 30, 2015 at 3:58 PM, Suneel Marthi suneel.mar...@gmail.com wrote: I believe its only Shannon from amongst the committer team who is using Eclipse. I am talking him out into shifting to IntelliJ. On Mon, Mar 30, 2015 at 6:54 PM, Stevo Slavić ssla...@gmail.com wrote: Hello team, I'm curious, is anyone of you using eclipse IDE? If not, then as part of MAHOUT-1278 I could remove a lot from our POMs. Kind regards, Stevo Slavic.
Re: MapR repo might need to be updated
Thanks. On it. On Mon, Mar 30, 2015 at 2:34 PM, Stevo Slavić ssla...@gmail.com wrote: Hello Ted, MapR Maven repository manager, seems to be Nexus, and it seems to be version 2.11.1 or older with this bug still in it: https://issues.sonatype.org/browse/NEXUS-7877 Mahout build uses MapR Maven repository, and for all artifacts/dependencies resolved from it, build output is polluted with warnings like: Downloading: http://repository.mapr.com/maven/org/apache/apache/16/apache-16.pom Mar 30, 2015 11:20:48 PM org.apache.maven.wagon.providers.http.httpclient.client.protocol.ResponseProcessCookies processCookies WARNING: Cookie rejected [rememberMe=deleteMe, version:0, domain: repository.mapr.com, path:/nexus, expiry:Mon Mar 30 23:20:48 CEST 2015] Illegal path attribute /nexus. Path of origin: /maven/org/apache/apache/16/apache-16.pom Please consider having it updated. Kind regards, Stevo Slavic.
Re: Require Java 7 and Hadoop 2.x?
There are subtle API incompatibilities. Unfortunate. But true. On Fri, Mar 27, 2015 at 10:16 AM, Pat Ferrel p...@occamsmachete.com wrote: As I said in the other thread forcing Java 7 is not as big a deal as forcing Hadoop 1.2.1. Is there some new part of 2.X that we need? Or some forced API incompatability? On Mar 27, 2015, at 9:58 AM, Suneel Marthi suneel.mar...@gmail.com wrote: TED??? please jump in. On Fri, Mar 27, 2015 at 12:54 PM, Pat Ferrel p...@occamsmachete.com wrote: Aren’t current Mahout 0.9 users on hadoop 1.2.1 by definition? Probably most on Java 6 too. Unless there is some strong reason it seems like we should support both of those for at least one release, shouldn’t we? I have a hadoop 1.2.1 cluster, which has a hadoop job that is not Hadoop 2 compatible so I’m stuck there for the time being. Compiling Mahout for this now gives an error over the H2 API “isDirectory”, which I think used to be “isDir” for H1. Has that API been deprecated in H2? Are we forced to chose either/or? On Mar 27, 2015, at 9:31 AM, Pat Ferrel p...@occamsmachete.com wrote: It should, Hadoop supports it long term and lots of people stuck there with projects that haven’t been upgraded (Mahout comes to mind). On Mar 27, 2015, at 9:26 AM, Stevo Slavić ssla...@gmail.com wrote: Have to check but I doubt that build supports hadoop 1.x any more. On Fri, Mar 27, 2015 at 5:15 PM, Suneel Marthi suneel.mar...@gmail.com wrote: This is the Java version, gotta use Java 7 On Fri, Mar 27, 2015 at 12:08 PM, Pat Ferrel p...@occamsmachete.com wrote: Latest source for Spark 1.1.0 and Hadoop 1.2.1 Build complains about the move to maven.compiler.target1.7/maven.compiler.target I think this was upped from 1.6 but not sure if that’s what the error is about. I’m on Java 6 no this machine if that matters. Actual error: [INFO] Mahout Build Tools SUCCESS [3.512s] [INFO] Apache Mahout . SUCCESS [0.603s] [INFO] Mahout Math ... FAILURE [6.453s] [INFO] Mahout MapReduce Legacy ... SKIPPED [INFO] Mahout Integration SKIPPED [INFO] Mahout Examples ... SKIPPED [INFO] Mahout Release Package SKIPPED [INFO] Mahout Math Scala bindings SKIPPED [INFO] Mahout Spark bindings . SKIPPED [INFO] Mahout Spark bindings shell ... SKIPPED [INFO] Mahout H2O backend SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11.609s [INFO] Finished at: Fri Mar 27 08:55:35 PDT 2015 [INFO] Final Memory: 24M/310M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile) on project mahout-math: Fatal error compiling: invalid target release: 1.7 - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR]
Re: Release
That is great news. Of course, Anand is doing that personally and doesn't actually work for h2o.ai (formerly 0xdata). It was the company that I meant. Apache contributors are individuals, of course, but having somebody be paid for building contributions definitely helps with avoiding distractions like finding groceries. On Tue, Mar 17, 2015 at 6:35 PM, Suneel Marthi suneel.mar...@gmail.com wrote: Who the heck said they have moved on? Anand confirmed just today that he would continue on h20 mahout integration. Sent from my iPhone On Mar 17, 2015, at 8:26 PM, todd rtmel...@gmail.com wrote: On 03/17/2015 12:49 PM, Ted Dunning wrote: I think it should be deprecated. The H2O guys have moved on after the reception they got. They've moved on? This has got to be one of the most disappointing things I have read in a long time.
Re: Release
On Tue, Mar 17, 2015 at 10:14 AM, Pat Ferrel p...@occamsmachete.com wrote: I’m nervous releasing H2O with no one supporting it. Is anyone signing up for that? I think it should be deprecated. The H2O guys have moved on after the reception they got.
Re: Mahout listed under Lucene category in Jira
This looks fixed. On Tue, Mar 17, 2015 at 10:36 AM, Dyer, James james.d...@ingramcontent.com wrote: Someone on the Lucene PMC noticed that Mahout JIRAs appear in our list at reporter.apache.org. We think this might be because Mahout is still listed under the Lucene category in Jira. ( https://issues.apache.org/jira/secure/BrowseProjects.jspa#10150). Is there an admin who can change the Mahout project's category in Jira? Thank you! James Dyer Ingram Content Group
Re: Neural network contribution
Burak, Sounds like a nice effort. Mahout is focussed on implementations in Java and lately Scala, not C. There is another project, however, just entering incubation that might fit much better. That project is the Singa project. The proposal is here: http://wiki.apache.org/incubator/SingaProposal I suggest that you contact Beng Chin Ooi (email on the proposal) to discuss what you have in more detail. The group that started the Singa project is very good on neural networks and should be able to comment better than we can. On Tue, Mar 10, 2015 at 4:55 AM, burak sarac bu...@linux.com wrote: Hello all, Few months ago I have completed a small Neural Network application for study. I just met Mahout and I liked! I was also looking for Neural Network implementation to compare my implementation and couldnt find any. If I am not wrong is there any chance I can contribute with my project? With Andrew NG samples 5000 digit calculates in 400 ms on single core and 40 ms my 8 core machine and 20 ms gpu. (Each iteration mostly does 1 calculation) I have used Fmincg implementation in C for optimization. At least you can do maybe code review for me? I will appreciate any feedback! Implementation in C. There is also more features which I didnt commit yet (improved feature scaling, using different hiddenlayer sizes per layer etc...) project here: https://github.com/buraksarac/NeuralNetwork https://github.com/buraksarac/NeuralNetwork main logic here : https://github.com/buraksarac/NeuralNetwork/blob/master/src/NeuralNetwork.cpp Thank you for your time! p.s. I have tried to send few emails I hope I didnt flood Burak Sarac
Re: What is Mahout?
+1 for keeping the name -1 for incubation On Thu, Feb 26, 2015 at 5:24 AM, Pat Ferrel p...@occamsmachete.com wrote: Along with workspaces, code completion, +1 for visualization and extended (bayesian, stats, etc) ops. Anything that is scalable and general seems fair game. Also -1 for incubation. This is all an evolution of loosely collected algos into generalizations and extensions of legacy stuff on new ground. Also +1 for separating out packages more formally—like spark-itemsimilarity and other things that aren’t general. They may come with generalized bits (like similarity) but have package like delivery mechanisms. We should be able to have something better than contrib, especially since these may come with math and core extensions generally useful. No need to separate that until the core is done. However a new identity would be a big boost to being able to communicate the new mission—and is it is a new mission. If the issue is about support for legacy that doesn’t seem to be a problem. If we stay a top level project we can support legacy, in fact we have to. On Feb 25, 2015, at 6:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: -1 on incubation as well. The website and docs and user lists and this champion and mentor stuff, and logos and promotions for committers absolutely do not make any sense at this point. From what i hear, people are pretty busy without having that as it is. It would probably make more sense to take both Andrews :) and committers who actively pursue the programming environment vision to PMC and for people who feel that they have no valuable input for new philosophy of the project just go emeritus and give up their voting rights. Power of do, as they say. There's no major change in philosophy either -- mahout has been proclaiming scalable machine learning, which is what we will continue doing. Only doing it (hopefully) a bit easier and with new set of backend tools. I want to emphasize that i'd seek math environment status in more general sense: not just algebraic, but also connect this to stats, samplers, optimizers, (including bayesian opts), feature extractors, i.e. all basic big ml tools. Adapt Spark's DataFrame to these tools where appropriate. Viewing it as solely distributed algebra is a bit skewed away from reality. On private branches, i have previously developed a lot of that functionality (except for the visual stuff) and it is in practice very useful; it creates a common umbrella for people with R background. I would very much want to integrate something for visualization, as it is important for environment. Unfortunately, I don't see any mature science plotting for jvm stuff around. Scatter plots at best. I want at least to be able to plot 2d maps and KDEs in with contours or density levels. There are ways to visualize massive datasets (and their parts). See no tools for this around at all. Maybe some clever way to integrate with ggplot2 or shiny server? even that would've been better, even if it required 3rd party software installation, than nothing at all. I don't expect methodologies go to contrib, actually. Slightly different modules, maybe, but not so extreme as contrib. On Wed, Feb 25, 2015 at 5:18 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: How much would be involved in changing the name of a top-level project? I'd prefer to avoid the overhead of going back into incubation. I agree 0.10 makes more sense. On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen sro...@gmail.com wrote: My $0.02: There is no shortage of algorithm libraries that are in some way runnable on Hadoop out there, and not as much easy-to-use distributed matrix operation libraries. I think it's more additive to the ecosystem to solve that narrow, and deep, linear algebra problem and really nail it. That's a pretty good 'identity' to claim. It seems like an appropriate scope. I do think the project has changed so much that it's more confusing to keep calling it Mahout than to change the name. I can't think of one person I've talked to about Mahout in the last 6 months that was not under the impression that what is in 0.9 has simply been ported to Spark. It's different enough that it could even be it's own incubator project (under a different name). The brand recognition is for the deprecated part so keeping that is almost the problem. It's not crazy to just change the name. Or even consider a re-incubation. It might give some latitude to more fully reboot. Releasing 1.0.0 on the other hand means committing to the APIs (and name) for some fairly new code and fairly soon. Given that this is sort of a 0.1 of a new project, going to 1.0 feels semantically wrong. But a release would be good. Personally I'd suggest 0.10. On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel p...@occamsmachete.com wrote: Looking back over the last year Mahout has gone through a lot
Re: PMML
PMML is a machine-to-machine mechanism, not intended really for human consumption or production. Based on XML, it is, of course, bloated, but that doesn't really matter for readability since reading isn't the goal. The vision of making models easy to transfer from system to system is nice, but the reality has fallen far short, unfortunately. The problem is that systems often have special aspects that make it hard to replicate exact actions from one system to another. Having a textual format for numerical data doesn't help. Here, for instance, is a linear regression model that I created using R: PMML version=4.2 xmlns=http://www.dmg.org/PMML-4_2; xmlns:xsi= http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation= http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd; Header copyright=Copyright (c) 2015 tdunning description=Linear Regression Model Extension name=user value=tdunning extender=Rattle/PMML/ Application name=Rattle/PMML version=1.4/ Timestamp2015-03-05 09:46:32/Timestamp /Header DataDictionary numberOfFields=4 DataField name=y optype=continuous dataType=double/ DataField name=x1 optype=continuous dataType=double/ DataField name=x2 optype=continuous dataType=double/ DataField name=x3 optype=continuous dataType=double/ /DataDictionary RegressionModel modelName=Linear_Regression_Model functionName=regression algorithmName=least squares MiningSchema MiningField name=y usageType=predicted/ MiningField name=x1 usageType=active/ MiningField name=x2 usageType=active/ MiningField name=x3 usageType=active/ /MiningSchema Output OutputField name=Predicted_y feature=predictedValue/ /Output RegressionTable intercept=-0.000669089797102863 NumericPredictor name=x1 exponent=1 coefficient=3.00018785681213/ NumericPredictor name=x2 exponent=1 coefficient=-1.00362806356329/ NumericPredictor name=x3 exponent=1 coefficient=0.998224481877296/ /RegressionTable /RegressionModel /PMML This looks pretty reasonable (if verbose). It takes 1.5kB to store a model but this compresses to around 600 bytes. More involved models are a different story. I built a simple random forest on the same data and simply conversion to PMML took several minutes. Presumably the R package involved is kind of inefficient, but this still is pretty daunting. Manipulating the resulting PMML representation is actually quite difficult. Saving the random forest model ultimately resulted in a 50MB file. Compression reduced that to about 6MB. This is pretty massive for a fairly simple model. On Thu, Mar 5, 2015 at 4:25 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I think keeping it simple is best, try implementing one or two models in XML and then get fancy if it makes sense. On Wednesday, March 4, 2015, Saikat Kanjilal sxk1...@hotmail.com wrote: Next question: Is the audience for PMML programmers or could it be folks that can script? I'm wondering how this intersects with a simple spark like DSL , could Mahout implement an intersection between the two? If there's interest I can go into examples. Sent from my iPhone On Mar 4, 2015, at 4:17 PM, Andrew Musselman andrew.mussel...@gmail.com javascript:; wrote: Sure, those would be options. On Wed, Mar 4, 2015 at 3:41 PM, Saikat Kanjilal sxk1...@hotmail.com javascript:; wrote: Question, is there a way to introduce PMML with using a more lightweight format like yaml or json? Date: Wed, 4 Mar 2015 13:25:29 -0800 Subject: Re: PMML From: andrew.mussel...@gmail.com javascript:; To: dev@mahout.apache.org javascript:; Yes, the limitations are often an issue for people doing things that aren't in the PMML spec yet; there could be room for suggesting new features in the spec by building them though, I suppose. Also agree that XML is a lousy/bloated way of representing stuff like this, but in the end it's just a choice of representation so there may be reason to use some other encoding and then provide an XML-export function. On Wed, Mar 4, 2015 at 11:42 AM, Dmitriy Lyubimov dlie...@gmail.com javascript:; wrote: I am willing to +1 any contribution at this point. my previous company used pmml to serialize simple stuff, but i don't have first hand experience. Its flexibility is ultimately pretty limited, isn't it? And xml is ultimately a media which is too ugly and too verbose at the same time to represent models with any more or less decent number of parameters? On Tue, Mar 3, 2015 at 8:19 PM, Suneel Marthi suneel.mar...@gmail.com javascript:; wrote: It makes sense to support PMML for classification and clustering tasks to be able to share and distribute trained models. Sean, Pat, Dmitriy and Ted please chime in. PMML support in Mahout was talked about for a long time now but never really got any traction to take off. +1 to
Re: Faster collections for a faster Mahout
What is the license on fastutils? I seem to remember that it was GPL at one time. On Sat, Jan 17, 2015 at 2:34 PM, Sebastiano Vigna vi...@di.unimi.it wrote: Dear developers, I'm writing to suggest to improve significantly Mahout's speed by replacing the current, Colt-based collections with faster collections. These are results from benchmarks at java-performance.info comparing fastutil and Mahout in get operations (Mahout collections were not included in the java-performance.info tests): tests.maptests.primitive.MahoutMapTest (1) = 2176.118213996 tests.maptests.primitive.FastUtilMapTest (1) = 782.852852799 tests.maptests.primitive.MahoutMapTest (10) = 2630.1235654 tests.maptests.primitive.FastUtilMapTest (10) = 1074.903566002 tests.maptests.primitive.MahoutMapTest (100) = 3969.1322968 tests.maptests.primitive.FastUtilMapTest (100) = 1940.7466792 This is with fastutil 6.6.1, which is comparable in speed to Koloboke or the GS collections (the java-performance.info tests use an older, slower version), and, I believe, faster for the purposes of Mahout. Get operations in Mahout collections are 2-3x slower. I modified locally RandomAccessSparseVector to use fastutil, and run some of the VectorBenchmarks. 0[main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Create (copy) RandSparseVectormean = 12.57us; mean = 64.88us; 32935 [main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Create (incrementally) RandSparseVector mean = 31.77us; mean = 79.33us; 244212 [main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Plus RandSparseVector mean = 47.36us; mean = 101.63us; On the left you can find the fastutil timings, on the right the Mahout timings. The only case in which I saw a slowdown is for some dense/sparse products: 429433 [main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Times Rand.fn(Dense)mean = 78us; mean = 52.47us; but I think this is due to the different way removals are handled: Mahout uses tombstones (and thus slows down all subsequent operations), whereas fastutil does true deletions, which are slightly slower at remove time, but make subsequent operations faster. Also, iteration over a fastutil-based RandomAccessSparseVector is slowed down by having to return non-standard Element instances instead of Map.Entry instances (as fastutil or the JDK would do naturally). If you'd like to benchmark the speed at a high level, the one-file drop-in is included (you'll need to add fastutil 6.6.1 as a dependency to mahout-math). As I said, things can be improved by using a standard Map.Entry (Int2DoubleMap.Entry) instead of Element. But this is a more pervasive change. Ciao, seba PS: One caveat: presently fastutil does not shrink backing arrays, which might not be what you want. It will, however, from the next release.
Re: Questions about Minhash/SimHash methods
I just looked a little bit am have a few questions. First, these appear to be java implementations for a single machine. How scalable is that? How would it interact with the new math framework? Second there are a number of style issue like author tags, indentation and such, but what I find most troubling is an almost complete lack of javadoc and complete lack of comments about the origin of the algorithms being used or non-trivial comments about what is happening in the code. I see comments on sections like update w. That doesn't say anything that the code doesn't say. Sent from my iPhone On Jan 10, 2015, at 1:45, Andrew Musselman andrew.mussel...@gmail.com wrote: Non-negative matrix factorization would be a good addition; if you can include tests with your pull request it will help. Assuming this is your PR: https://github.com/apache/mahout/pull/70 Looking forward to more. On Jan 9, 2015, at 11:21 PM, 梁明强 mqliang031...@gmail.com wrote: Dear sir, Here is Liang Mingqiang, an undergraduate student, highly interested in Recommender System and Mahout. I have implete Non-negative Matrix Factorization(NMF) and Probabilistic Matrix Factorization(PMF) method and pull request my code for further comment. I test my code on my computer using movielens dataset and get reasonable result. Do I need to write and submit a test module for my code. Just because I need dataset for my test, can I add some text files in the test package? In addition, Binary Matrix Factorization seems(BMF) very interesting, I want contribute my BMF code for Mahout in the next step. Last, but not least, Minhash and SimHash are very popular and useful methods in Recommender System. But I look through the source code of Mahout, there seems no Minhash and SimHash method. Does it mean those methods haven't been contributed or just because I haven't check the source code carefully. If those two methods have benn contributed, is there anyone willing to tell me the path. Thank you! Looking forward, Liang Mingqiang
Re: Questions about Minhash/SimHash methods
On Sun, Jan 11, 2015 at 6:51 PM, 梁明强 mqliang031...@gmail.com wrote: In addition, what you mean the new math framework here? Mahout has a new math framework written in scala that parallelizes mathematical operations.
Re: kmeans result is different from scikit-learn result with center points provided
Running this gist can be done using the following two lines of R, btw: library(devtools) source_url( https://gist.githubusercontent.com/tdunning/e1575ad2043af732c219/raw/444514454a6f3b5fcbbcaa3f8a919b1965e07f16/Clustering%20is%20hard ) You should see something like this as output: SHA-1 hash of file is 2bc9bf7677d6d5b8b7aa1b1d49749574f5bd942e $fail [1] 96 $success [1] 4 counts 1 2 3 4 4 71 22 3 On Mon, Jan 5, 2015 at 11:50 PM, Ted Dunning ted.dunn...@gmail.com wrote: Clustering is harder than you appear to think: http://www.imsc.res.in/~meena/papers/kmeans.pdf https://en.wikipedia.org/wiki/K-means_clustering NP-hard problems are typically solved by approximation. K-means is a great example. Only a few, relatively unrealistic, examples have solutions apparent enough to be found reliably by diverse algorithms. For instance, something as easy as Gaussian clusters with sd=1e-3 situated on 10 random corners of a unit hypercube in 10 dimensional space will be clustered differently by many algorithms unless multiple starts are used. For instance see https://gist.github.com/tdunning/e1575ad2043af732c219 for an R script that demonstrates that R's standard k-means algorithms fail over 95% of the time for this trivial input, occasionally splitting a single cluster into three parts. Restarting multiple times doesn't fix the problem ... it only makes it a bit more tolerable. This example shows how even 90 restarts could fail for this particular problem. On Mon, Jan 5, 2015 at 11:03 PM, Lee S sle...@gmail.com wrote: But parameters and distance measure is the same. Only difference: Mahout kmeans convergence is based on whether every cluster is convergenced. scikit-learn is based on within-cluster sum of squared criterion. 2015-01-06 14:15 GMT+08:00 Ted Dunning ted.dunn...@gmail.com: I don't think that data is sufficiently clusterable to expect a unique solution. Mean squared error would be a better measure of quality. On Mon, Jan 5, 2015 at 10:07 PM, Lee S sle...@gmail.com wrote: Data in thie link: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data . I convert it to sequencefile with InputDriver. 2015-01-06 14:04 GMT+08:00 Ted Dunning ted.dunn...@gmail.com: What kind of synthetic data did you use? On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote: Hi, I used the synthetic data to test the kmeans method. And I write the code own to convert center points to sequecefiles. Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd 1 -cl) , I compared the dumped clusteredPoints with the result of scikit-learn kmens result, it's totally different. I'm very confused. Does anybody ever run kmeans with center points provided and compare the result with other ml-library?
Re: kmeans result is different from scikit-learn result with center points provided
I don't think that data is sufficiently clusterable to expect a unique solution. Mean squared error would be a better measure of quality. On Mon, Jan 5, 2015 at 10:07 PM, Lee S sle...@gmail.com wrote: Data in thie link: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data . I convert it to sequencefile with InputDriver. 2015-01-06 14:04 GMT+08:00 Ted Dunning ted.dunn...@gmail.com: What kind of synthetic data did you use? On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote: Hi, I used the synthetic data to test the kmeans method. And I write the code own to convert center points to sequecefiles. Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd 1 -cl) , I compared the dumped clusteredPoints with the result of scikit-learn kmens result, it's totally different. I'm very confused. Does anybody ever run kmeans with center points provided and compare the result with other ml-library?
Re: kmeans result is different from scikit-learn result with center points provided
Clustering is harder than you appear to think: http://www.imsc.res.in/~meena/papers/kmeans.pdf https://en.wikipedia.org/wiki/K-means_clustering NP-hard problems are typically solved by approximation. K-means is a great example. Only a few, relatively unrealistic, examples have solutions apparent enough to be found reliably by diverse algorithms. For instance, something as easy as Gaussian clusters with sd=1e-3 situated on 10 random corners of a unit hypercube in 10 dimensional space will be clustered differently by many algorithms unless multiple starts are used. For instance see https://gist.github.com/tdunning/e1575ad2043af732c219 for an R script that demonstrates that R's standard k-means algorithms fail over 95% of the time for this trivial input, occasionally splitting a single cluster into three parts. Restarting multiple times doesn't fix the problem ... it only makes it a bit more tolerable. This example shows how even 90 restarts could fail for this particular problem. On Mon, Jan 5, 2015 at 11:03 PM, Lee S sle...@gmail.com wrote: But parameters and distance measure is the same. Only difference: Mahout kmeans convergence is based on whether every cluster is convergenced. scikit-learn is based on within-cluster sum of squared criterion. 2015-01-06 14:15 GMT+08:00 Ted Dunning ted.dunn...@gmail.com: I don't think that data is sufficiently clusterable to expect a unique solution. Mean squared error would be a better measure of quality. On Mon, Jan 5, 2015 at 10:07 PM, Lee S sle...@gmail.com wrote: Data in thie link: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data . I convert it to sequencefile with InputDriver. 2015-01-06 14:04 GMT+08:00 Ted Dunning ted.dunn...@gmail.com: What kind of synthetic data did you use? On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote: Hi, I used the synthetic data to test the kmeans method. And I write the code own to convert center points to sequecefiles. Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd 1 -cl) , I compared the dumped clusteredPoints with the result of scikit-learn kmens result, it's totally different. I'm very confused. Does anybody ever run kmeans with center points provided and compare the result with other ml-library?
Re: kmeans result is different from scikit-learn result with center points provided
What kind of synthetic data did you use? On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote: Hi, I used the synthetic data to test the kmeans method. And I write the code own to convert center points to sequecefiles. Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd 1 -cl) , I compared the dumped clusteredPoints with the result of scikit-learn kmens result, it's totally different. I'm very confused. Does anybody ever run kmeans with center points provided and compare the result with other ml-library?
[jira] [Assigned] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient
[ https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reassigned MAHOUT-1636: --- Assignee: Ted Dunning Class dependencies for the spark module are put in a job.jar, which is very inefficient --- Key: MAHOUT-1636 URL: https://issues.apache.org/jira/browse/MAHOUT-1636 Project: Mahout Issue Type: Bug Components: spark Affects Versions: 1.0-snapshot Reporter: Pat Ferrel Assignee: Ted Dunning Fix For: 1.0-snapshot using a maven plugin and an assembly job.xml a job.jar is created with all dependencies including transitive ones. This job.jar is in mahout/spark/target and is included in the classpath when a Spark job is run. This allows dependency classes to be found at runtime but the job.jar include a great deal of things not needed that are duplicates of classes found in the main mrlegacy job.jar. If the job.jar is removed, drivers will not find needed classes. A better way needs to be implemented for including class dependencies. I'm not sure what that better way is so am leaving the assembly alone for now. Whoever picks up this Jira will have to remove it after deciding on a better method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient
[ https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258393#comment-14258393 ] Ted Dunning commented on MAHOUT-1636: - The MIT license is one of the most liberal licenses around and is completely compatible with Apache as a dependency. You can find more information including a list of the so-called category A (totally OK) licenses and the category X (no way, no how) licenses here: http://www.apache.org/legal/resolved.html#category-a Class dependencies for the spark module are put in a job.jar, which is very inefficient --- Key: MAHOUT-1636 URL: https://issues.apache.org/jira/browse/MAHOUT-1636 Project: Mahout Issue Type: Bug Components: spark Affects Versions: 1.0-snapshot Reporter: Pat Ferrel Assignee: Ted Dunning Fix For: 1.0-snapshot using a maven plugin and an assembly job.xml a job.jar is created with all dependencies including transitive ones. This job.jar is in mahout/spark/target and is included in the classpath when a Spark job is run. This allows dependency classes to be found at runtime but the job.jar include a great deal of things not needed that are duplicates of classes found in the main mrlegacy job.jar. If the job.jar is removed, drivers will not find needed classes. A better way needs to be implemented for including class dependencies. I'm not sure what that better way is so am leaving the assembly alone for now. Whoever picks up this Jira will have to remove it after deciding on a better method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: The next time someone wants to help
Hadoop dependencies are a quagmire. It would be far preferable to rewrite the necessary serialization to avoid Hadoop dependencies entirely. If we dropping the MR code, why do we need to reference the VectorWritable class at all? Even in the worse case, we could simply recode the binary layer from scratch without the heinous dependencies. On Fri, Dec 12, 2014 at 10:06 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: A bit more detail on what needs to happen here IMO: Likely, hadoop-releated things we still need for spark etc. like VectorWritable need to be factored out into a (new) module mahout-hadoop or something. Important notion here is that we only want to depend on hadoop-commons, which in theory should be common for both new and old hadoop MR apis. We may face the fact that we need hdfs as well there; e.g. perhaps for reading sequence file headers, not sure; but we definitely do not need anything mapreduce. Math still cannot depend on that mahout-hadoop since math must not depend on anything hadoop, that was the premise since like the beginning. Mahout-math is in-core ops only, lightweight, self-contained thing. more likely, spark module (and maybe some others if they use that) will have to depend on hadoop serialization for vectors and matrices directly, i.e. on mahout-hadoop. mrlegacy stuff of course needs to be completely isolated (nobody else depends on it) and made dependent on mahout-hadoop as well. On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel p...@occamsmachete.com wrote: The next time someone wants to get into contributing to Mahout, wouldn’t it be nice to prune dependencies? For instance Spark depends on math-scala, which depends on math—at least ideally but in reality dependencies include mr-legacy. If some things were refactored into math we might have a much streamlined dependency tree. Some things in Math also can be replaced with newer Scala libs and so could be moved out to a java-common or something that would not be required by the Scala code. If people are going to use the V1 version of Mahout it would be nice if the choice didn’t force them to drag along all the legacy code if it isn’t being used.
Re: I would like to contribute to the Mahout library
On Thu, Nov 27, 2014 at 6:11 AM, Ray rtmel...@gmail.com wrote: 1) Sign up to maintain the fpgrowth code, with the thought of adding some alternative to the Hadoop MapReduce portion of the implementation. 2) Is there still interest in a deep autoencoder for time series? Both of these are of interest, the first particularly so since several people have asked about this lately. Having a non-map-reduce version of fp-growth would make it possible to maintain that code going forward.
Re: elementwise operator improvements experiments
Isn't it true that sparse iteration should always be used for m := f iff 1) the matrix argument is sparse AND 2) f(0) == 0 ? Why the need for syntactic notation at all? This property is much easier to test than commutativity. On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: another thing is that optimizer isn't capable of figuring out all elementwise fusions in an elementwise expression, e.g. it is not seing commutativity rewrites such as A * B * A should optimally be computed as sqr(A) * B (it will do it as two pairwise operators (A*B)*A). Bummer. To do it truly right, it needs to fuse entire elementwise expressions first and then optmize them separately. Ok that's probably too much for now. I am quite ok with writing something like -0.5 * (a * a ) for now. On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS actually applying an exponent funciton in place will require addtional underscore it looks. It doesn't want to treat function name as function type in this context for some reason (although it does not require partial syntax when used in arguments inside parenthesis): m := exp _ Scala is quirky this way i guess On Sat, Nov 15, 2014 at 10:02 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: So i did quick experimentation with elementwise operator improvements: (1) stuff like 1 + exp (M): (1a): this requires generalization in optimizer for elementwise unary operators. I've added things like notion if operators require non-zero iteration only or not. (1b): added fusion of elemntwise operators, i.e. ew(1+, ew(exp, A)) is rewritten as ew (1+exp, A) for performance reasons. It still uses an application of a fold over functional monoid, but i think it should be fairly ok performance/DSL trade-off here. to get it even better, we may add functional assignment syntax to distributed operands similar to in-memory types as descrbed further down. (1c): notion that self elementwise things such as expr1 * expr1 (which is surprisingly often ocurrence, e..g in Torgerson MDS) are rewritten as ew(A, square) etc. So that much works. (Note that this also obsoletes dedicated scalar/matrix elementwise operators that there currently are). Good. The problem here is that (of course!) semantics of the scala language has problem importing something like exp(Double):Double alongside with exp(DRM):DRM apparently because it doesn't adhere to overloading rules (different results) so in practice even though it is allowed, one import overshadows the other. Which means, for the sake of DSL we can't have exp(matrix), we have to name it something else. Unless you see a better solution. So ... elementwise naming options: Matrix: mexp(m), msqrt(m). msignum(m) Vector: vexp(v), vsqrt(v)... DRM: dexp(drm), dsqrt(drm) ? Let me know what you think. (2) Another problem is that actually doing something like 1+exp(m) on Matrix or Vector types is pretty impractical since, unlike in R (that can count number of bound variables to an object) the semantics requires creating a clone of m for something like exp(m) to guarantee no side effects on m itself. That is, expression 1 + exp(m) for Matrix or vector types causes 2 clone-copies of original argument. actually that's why i use in-place syntax for in-memory types quite often, something like 1+=: (x *= x) instead of more naturally looking 1+ x * x. But unlike with simple elementwise operators (+=), there's no in-place modification syntax for a function. We could put an additional parameter, something like mexp(m, inPlace=true) but i don't like it too much. What i like much more is functional assignment (we already have assignment to a function (row, col, x) = Double but we can add elementwise function assignment) so that it really looks like m := exp That is pretty cool. Except there's a problem of optimality of assignment. There are functions here (e.g. abs, sqrt) that don't require full iteration but rather non-zero iteration only. by default notation m := func implies dense iteration. So what i suggest here is add a new syntax to do sparse iteration functional assignments: m ::= abs I actually like it (a lot) because it short and because it allows for more complex formulas in the same traversal, e.g. proverbial R's exp(m)+1 in-place will look m := (1 + exp(_)) So not terrible. What it lacks though is automatic determination of composite function need to apply to all vs. non-zeros only for in-memory types (for distributed types optimizer tracks this automatically). i.e. m := abs is not optimal (because abs doesn't affect 0s) and m ::= (abs(_) + 1) is probably also not what one wants (when we have composition of dense and sparse affecting functions, result is dense
Re: SGD Implementation and Questions for mapBlock like functionality
On Wed, Nov 12, 2014 at 9:53 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: once we start mapping aggregate, there's no reason not to map other engine specific capabilities, which are vast. At this point dilemma is, no matter what we do we are losing coherency: if we map it all, then other engines will have trouble supporting all of it. If we don't map it all, then we are forcing capability reduction compared to what the engine actually can do. It is obvious to me that all-reduce aggregate will make a lot of sense -- even if it means math checkpoint. but then where do we stop in mapping those. E.g. do we do fold? cartesian? And what is that true reason we are remapping everything if it is already natively available? etc. etc. For myself, I still haven't figured a good answer to those . Actually, I disagree with the premise here. There *is* a reason not to map all other engine specific capabilities. That reason is we don't need them. Yet. So far, we *clearly* need some sort of block aggregate for a host of hog-wild sorts of algorithms. That doesn't imply that we need all kinds of mapping aggregates. It just means that we are clear on one need for now. So let's get this one in and see how far we can go. Also, having one kind of aggregation in the DSL does not restrict anyone from using engine specific capabilities. It just means that one kind of idiom can be done without engine specificity.
Re: SGD Implementation and Questions for mapBlock like functionality
On Wed, Nov 12, 2014 at 2:08 PM, Gokhan Capan gkhn...@gmail.com wrote: Can we easily integrate t-digest for descriptives once we have block aggregates? This might count one more reason. Presumably. T-digest is already in Mahout as part of the OnlineSummarizer.
Re: Mahout 1.0 features (revisited)
On Thu, Oct 23, 2014 at 3:57 PM, Andrew Palumbo ap@outlook.com wrote: Or I can just commit as is and people can have at the organization. Sounds good to me!
Re: Upgrade to Spark 1.1.0?
On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel p...@occamsmachete.com wrote: The problem is not in building Spark it is in building Mahout using the correct Spark jars. If you are using CDH and hadoop 2 the correct jars are in the repos. This should be true for MapR as well.
Re: Upgrade to Spark 1.1.0?
On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel p...@occamsmachete.com wrote: Getting off the dubious Spark 1.0.1 version is turning out to be a bit of work. Does anyone object to upgrading our Spark dependency? I’m not sure if Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading your Spark cluster. It is going to have to happen sooner or later. Sooner may actually be less total pain.
Re: How to build a recommendation system based on mahout serving millions even billions of users ?
You should move forward to version 0.9. Take a look at more recent methods in this book: https://www.mapr.com/practical-machine-learning On Tue, Oct 14, 2014 at 2:43 AM, 王建国 jordanhao...@gmail.com wrote: Hi,Owen and all: I am a developer from china.I am building a recommendation sysytem based on mahhout in version-0.9.Since the userids and itemids are string, I need to map them to long.But I found that there is a Long-to-Int mapping provided by the function int TasteHadoopUtils.idToIndex(long). Considering there may be millions even billions of users,I wonder if it possible to have many long mapped into one int? If ture,that does do much harm . This is quite confusing.What solution should I choose in this situation?Meanwhile,I read the answer from you as followed.Could you please tell me which data structure indexed by long you use in Myrrix. Thanks in advance. wangjiangwei Question: I have read some code about item-based recommendation in version-0.6, starting from org.apache.mahout.cf.taste. hadoop.item.RecommenderJob. I found that there is a Long-to-Int mapping provided by the function int TasteHadoopUtils.idToIndex(long). Long-to-Int is performed both on userId and itemId. I wonder if it possible to have two long mapped into one int? If it is the case, then we would likely to merge vectors from different itemids/uids, right? This is quite confusing. Is it better to provide a RandomAccessSparseVector implemented by OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance. Wei Feng Answer: That's right. It ought to be uncommon but can happen. For recommenders, it only means that you start to treat two users or two items as the same thing. That doesn't do much harm though. Maybe one user's recs are a little funny. I do think it would have been useful to index by long, but that would have significantly increased memory requirements too. (In developing Myrrix I have switched to use a data structure indexed by long though, because it becomes more necessary to avoid the mapping.) Sean Owen
Re: The portability of MAHOUT platform to python
It is plausible to port some of the newer scala stuff to python. It would take some thought about the right way to do it. The kicker is going to be that a lot of what Mahout does bottoms out in math that is written in Java. How that would work from Python is mysterious to me. On Mon, Oct 13, 2014 at 9:18 PM, Vibhanshu Prasad vibhanshugs...@gmail.com wrote: Hello Everyone, I am a college student who wants to contribute towards the development of the mahout library. I have been using this for last 1 year and was mesmerized by its features. I wanted to know if someone is working towards exporting this whole platform to python. If no, then is there is any possible way i can start doing it. provided that I am not a committer yet . Regards Vibhanshu
Re: https://mahout.apache.org/developers/buildingmahout.html
I believe that the POM assumes particular versions as listed are version 2 and all others 1. Inspection of the top-level pom would provide the most authoritative answer. On Wed, Oct 1, 2014 at 7:08 AM, jay vyas jayunit100.apa...@gmail.com wrote: hi mahout: Can we use any hadoop version to build mahout? i.e. 2.4.1 ? It seems like if you give it a garbage hadoop version i.e. (2.3.4.5) , it still builds, yet at runtime, it is clear that the version built is a 1.x version. thanks ! FYI this is in relation to BIGTOP=-1470, where we are just getting ready for our 0.8 release, so any feedback would be much appreciated ! -- jay vyas
Re: Interested in developing for mahout
Thejas, A good starter task would be to gather the discussions about the new recommendation system in Scala and write up a tutorial for using it. Writing new bindings in the math section requires a bit of advanced knowledge of Scala and an ability to read some subtle code. Probably not the best starting point. On Mon, Sep 29, 2014 at 11:34 AM, thejas prasad thejch...@gmail.com wrote: Hey Ted, It seemed interesting. I was looking at Jira and also Git and It seemed as though some scala bindings where already implemented.. Am I correct? I wanted to take up a task that is trivial since I am new to scala and even mahout. With that said I would be interested in writing more matlab bindings. Does that sound okay? -Thejas On Sun, Sep 28, 2014 at 3:15 PM, Aamir Khan 9aamirk...@gmail.com wrote: Hi, I am also new to Apache and Mahout. This thread caught my attention. Can you tell what are the areas where development is required. Is there any work on *Clustering*? Any guidance on how to start and useful links are highly appreciated. Many thanks, On Mon, Sep 29, 2014 at 1:19 AM, Ted Dunning ted.dunn...@gmail.com wrote: Thejas, What were your impressions? Which parts of the system match your background and capabilities? On Sun, Sep 28, 2014 at 11:46 AM, Thejas Prasad thejch...@gmail.com wrote: Hey suneel, I finished reading the paper. What's next? Sent from my iPhone On Sep 26, 2014, at 7:04 PM, Suneel Marthi smar...@apache.org wrote: See this for a start http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf On Fri, Sep 26, 2014 at 8:02 PM, thejas prasad thejch...@gmail.com wrote: what exactly in the scala math library? On Fri, Sep 26, 2014 at 1:00 AM, Ted Dunning ted.dunn...@gmail.com wrote: Got it! Sorry to be dense. On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad thejch...@gmail.com wrote: Sorry I meant to say what is the best way to get started**? Thanks, Thejas Sent from my iPhone On Sep 25, 2014, at 4:28 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com wrote: what is the best way to get statues Hmmm I am totally confused. You must have meant something here. Regarding your next question, the place to start work is on the scala math library.
Re: Interested in developing for mahout
Aamir, There would be a substantial interest in clustering, especially the adaptation of our existing streaming k-means and standard k-means to the new math system in Scala. Part of doing that would require some extension of the framework to include a reduce operation. On Sun, Sep 28, 2014 at 1:15 PM, Aamir Khan 9aamirk...@gmail.com wrote: Hi, I am also new to Apache and Mahout. This thread caught my attention. Can you tell what are the areas where development is required. Is there any work on *Clustering*? Any guidance on how to start and useful links are highly appreciated. Many thanks, On Mon, Sep 29, 2014 at 1:19 AM, Ted Dunning ted.dunn...@gmail.com wrote: Thejas, What were your impressions? Which parts of the system match your background and capabilities? On Sun, Sep 28, 2014 at 11:46 AM, Thejas Prasad thejch...@gmail.com wrote: Hey suneel, I finished reading the paper. What's next? Sent from my iPhone On Sep 26, 2014, at 7:04 PM, Suneel Marthi smar...@apache.org wrote: See this for a start http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf On Fri, Sep 26, 2014 at 8:02 PM, thejas prasad thejch...@gmail.com wrote: what exactly in the scala math library? On Fri, Sep 26, 2014 at 1:00 AM, Ted Dunning ted.dunn...@gmail.com wrote: Got it! Sorry to be dense. On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad thejch...@gmail.com wrote: Sorry I meant to say what is the best way to get started**? Thanks, Thejas Sent from my iPhone On Sep 25, 2014, at 4:28 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com wrote: what is the best way to get statues Hmmm I am totally confused. You must have meant something here. Regarding your next question, the place to start work is on the scala math library.
Re: Interested in developing for mahout
Thejas, What were your impressions? Which parts of the system match your background and capabilities? On Sun, Sep 28, 2014 at 11:46 AM, Thejas Prasad thejch...@gmail.com wrote: Hey suneel, I finished reading the paper. What's next? Sent from my iPhone On Sep 26, 2014, at 7:04 PM, Suneel Marthi smar...@apache.org wrote: See this for a start http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf On Fri, Sep 26, 2014 at 8:02 PM, thejas prasad thejch...@gmail.com wrote: what exactly in the scala math library? On Fri, Sep 26, 2014 at 1:00 AM, Ted Dunning ted.dunn...@gmail.com wrote: Got it! Sorry to be dense. On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad thejch...@gmail.com wrote: Sorry I meant to say what is the best way to get started**? Thanks, Thejas Sent from my iPhone On Sep 25, 2014, at 4:28 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com wrote: what is the best way to get statues Hmmm I am totally confused. You must have meant something here. Regarding your next question, the place to start work is on the scala math library.
Re: Interested in developing for mahout
Got it! Sorry to be dense. On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad thejch...@gmail.com wrote: Sorry I meant to say what is the best way to get started**? Thanks, Thejas Sent from my iPhone On Sep 25, 2014, at 4:28 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com wrote: what is the best way to get statues Hmmm I am totally confused. You must have meant something here. Regarding your next question, the place to start work is on the scala math library.
Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes
On Wed, Sep 24, 2014 at 11:09 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Aggregate is Colt's thing. Colt (aka Mahout-math) establish java-side concept of different function types which are unfortunately not compatible with Scala literals. Dmitriy, Is this because we have other methods that describe the characteristics of the function? What would be the Scala friendly idiom? Additional traits?
Re: Interested in developing for mahout
On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com wrote: what is the best way to get statues Hmmm I am totally confused. You must have meant something here. Regarding your next question, the place to start work is on the scala math library.
Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes
Yes. That code is computing Frobenius norm. I can't answer the context question about Scala calling Java, however. On Wed, Sep 24, 2014 at 9:15 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Shannon/Dmitry,Quick question, I'm wanting to calculate the scala equivalent of the frobenius norm per this API spec in python ( http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html), I dug into the mahout-math-scala project and found the following API to calculate the norm: def norm = sqrt(m.aggregate(Functions.PLUS, Functions.SQUARE)) I believe the above is also calculating the frobenius norm, however I am curious why we are calling a Java API from scala, the type of m above is a java interface called Matrix, I'm guessing the implementation of aggregate is happening in the math-math-scala somewhere, is that assumption correct? Thanks in advance. From: sxk1...@hotmail.com To: dev@mahout.apache.org Subject: RE: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes Date: Thu, 18 Sep 2014 12:51:36 -0700 Ok great I'll use the cartesian spark API call, so what I'd still like some thoughts on where the code that calls the cartesian should live in our directory structure. Date: Thu, 18 Sep 2014 15:33:59 -0400 From: squ...@gatech.edu To: dev@mahout.apache.org Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes Saikat, Spark has the cartesian() method that will align all pairs of points; that's the nontrivial part of determining an RBF kernel. After that it's a simple matter of performing the equation that's given on the scikit-learn doc page. However, like you said it'll also have to be implemented using the Mahout DSL. I can envision that users would like to compute pairwise metrics for a lot more than just RBF kernels (pairwise Euclidean distance, etc), so my guess would be a DSL implementation of cartesian() is what you're looking for. You can build the other methods on top of that. Correct me if I'm wrong. Shannon On 9/18/14, 3:28 PM, Saikat Kanjilal wrote: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html I need to implement the above in the scala world and expose a DSL API to call the computation when computing the affinity matrix. From: ted.dunn...@gmail.com Date: Thu, 18 Sep 2014 10:04:34 -0700 Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes To: dev@mahout.apache.org There are number of non-traditional linear algebra operations like this that are important to implement. Can you describe what you intend to do so that we can discuss the shape of the API and computation? On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Dmitry et al,As part of the above JIRA I need to calculate the gaussian kernel between 2 shapes, I looked through mahout-math-scala and didnt see anything to do this, any objections to me adding some code under scalabindings to do this? Thanks in advance.
Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes
There are number of non-traditional linear algebra operations like this that are important to implement. Can you describe what you intend to do so that we can discuss the shape of the API and computation? On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Dmitry et al,As part of the above JIRA I need to calculate the gaussian kernel between 2 shapes, I looked through mahout-math-scala and didnt see anything to do this, any objections to me adding some code under scalabindings to do this? Thanks in advance.
Re: rowsimilarity
LLR with text is commonly done (that is where it comes from). The simple approach would be to have sentences be users and words be items. This will result in word-word connections. This doesn't directly give document-document similarities. That could be done by transposing the original data (word is user, document is item) but I don't quite understand how to interpret that. Another approach is simply using term weighting and document normalization and scoring every doc against every other. That comes down to a matrix multiplication which is very similar to the transposed LLR problem so that may give an interpretation. On Mon, Aug 25, 2014 at 10:15 AM, Pat Ferrel p...@occamsmachete.com wrote: LLR with text or non-interaction data. What do we use for counts? Do we care how many times a token is seen in a doc for instance or do we just look to see if it was seen. I assume the later, which means we need a new numNonZeroElementsPerRow several places in math-scala, right? All the same questions are going to come up over this as did for numNonZeroElementsPerColumn so please speak now or I’ll just mirror its implementation. On Aug 25, 2014, at 9:38 AM, Pat Ferrel pat.fer...@gmail.com wrote: Turning itemsimilarity into rowsimilarity if fairly simple but requires altering CooccurrenceAnalysis.cooccurrence to swap the transposes and calculate the LLR values for rows rather than columns. The input will be something like a DRM. Row similarity does something like AA’ with LLR weighting and uses similar downsampling as I take it from the Hadoop code. Let me know if I’m on the wrong track here. With the new application ID preserving code the following input could be directly processed (it’s my test case) doc1\tNow is the time for all good people to come to aid of their party doc2\tNow is the time for all good people to come to aid of their country doc3\tNow is the time for all good people to come to aid of their hood doc4\tNow is the time for all good people to come to aid of their friends doc5\tNow is the time for all good people to come to aid of their looser brother doc6\tThe quick brown fox jumped over the lazy dog doc7\tThe quick brown fox jumped over the lazy boy doc8\tThe quick brown fox jumped over the lazy cat doc9\tThe quick brown fox jumped over the lazy wolverine doc10\tThe quick brown fox jumped over the lazy cantelope The output will be something like the following, with or without LLR strengths. doc1\tdoc2 doc3 doc4 doc5 … doc6\tdoc7 doc8 doc9 doc10 ... It would be pretty easy to tack on a text analyzer from lucene to turn this into a full function doc similarity job since LLR doesn’t need TF-IDF. One question is: is there any reason to do the cross-similarity in RSJ, so [AB’]? I can’t picture what this would mean so am assuming the answer is no.
Re: [jira] [Commented] (MAHOUT-1610) Tests can be made more robust to pass in Java 8
On Thu, Aug 28, 2014 at 6:04 AM, ASF GitHub Bot (JIRA) j...@apache.org wrote: Github user srowen commented on the pull request: https://github.com/apache/mahout/pull/46#issuecomment-53716190 I may still have the commit bit for ASF git, but can't merge the pull request myself. (I also realize I'm not yet sure if there's another step? will asfbot merge back to ASF git if merged here?) If you do the commit with the github note closes #xx, then github does the right thing. Your commit does the merge.
[jira] [Commented] (MAHOUT-1610) Tests can be made more robust to pass in Java 8
[ https://issues.apache.org/jira/browse/MAHOUT-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112304#comment-14112304 ] Ted Dunning commented on MAHOUT-1610: - Looks good to me. Tests can be made more robust to pass in Java 8 --- Key: MAHOUT-1610 URL: https://issues.apache.org/jira/browse/MAHOUT-1610 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.9 Environment: Java 1.8.0_11 OS X 10.9.4 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Labels: java8, tests Right now, several tests don't seem to pass when run with Java 8 (at least on Java 8). The failures are benign, and just due to tests looking for too-specific values or expecting things like a certain ordering of hashmaps. The tests can easily be made to pass both Java 8 and Java 6/7 at the same time by either relaxing the tests in a principled way, or accepting either output of two equally valid ones as correct. (There's also one curious compilation failure in Java 8, related to generics. It is fixable by changing to a more explicit declaration that should be equivalent. It should be entirely equivalent at compile time, and of course, at run time. I am not sure it's not just a javac bug, but, might as well work around when it's so easy.) -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Features by engine page
On Mon, Aug 25, 2014 at 2:40 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: This work is obviously also interesting in that it establishes probabilistic framework in Mahout (distributions gaussian process). We already have that. (distributions not GP) Note that we also have an implementation of recorded step evolutionary programming that works really well for hyper-parameter search. I don't like the way that the API turned out (too hard to understand).