Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair
On 19/07/18 10:24, Sebastian Schelter wrote: Congrats! +1 Looking forward to hearing Andrew's voice on one of the upcoming board calls - please do feel invited to join as a new PMC chair. Isabel
Re: New logo
The green logo was the very first design iteration before iirc Robin came up with the yellow one. The should be like five TShirts world wide with the old logo printed in 2009. Am 1. Mai 2017 20:41:43 MESZ schrieb Trevor Grant: >Thanks Scott, > >You are correct- in fact we're going even further now, that you can do >native optimization regardless of the architecture with native-solvers. > >Do you or anyone more familiar with the history of the website know >anything about the origins/uses of this: >https://mahout.apache.org/images/Mahout-logo-245x300.png >It seems to be a green mahout logo. > >Also Scott, or anyone lurking who may be able to help. As part of the >website reboot I've included a "history" page and would really >apppreciate >some help capturing that from first person sources if possible. Ive put >in >some headers but those are only directional: > >https://github.com/rawkintrevo/mahout/blob/website/website/front/community/history.md > > > >Trevor Grant >Data Scientist >https://github.com/rawkintrevo >http://stackexchange.com/users/3002022/rawkintrevo >http://trevorgrant.org > >*"Fortunate is he, who is able to know the causes of things." -Virgil* > > >On Mon, May 1, 2017 at 11:18 AM, scott cote >wrote: > >> Trevor et al: >> >> Some ideas to spur you on (and related points): >> >> Mahout is no longer a grab bag of algorithms and routines, but a math >> language right? You don’t care about the under the cover >implementation. >> Today its Spark with alternative implementations in Flink, etc …. >> >> Don’t know if that is the long term goal still - haven’t kept up - >but it >> seems like you are insulating yourself from the underlying >technology. >> >> Math is a universal language. Right? >> >> Tower of Babel is coming to mind …. >> >> SCott >> >> > On Apr 27, 2017, at 10:27 PM, Trevor Grant > >> wrote: >> > >> > It also bugs me when I can't suggest any alternatives, yet don't >like the >> > ones in front of me... >> > >> > I became aware of a symbol a week or so ago, and it keeps coming >back to >> > me. >> > >> > The Enso. >> > https://en.wikipedia.org/wiki/Ens%C5%8D >> > >> > Things I like about it: >> > (all from wikipedia, since the only thing I knew about this symbol >prior >> is >> > that someone I met had a tattoo of it). >> > It represents (among a few other things) enlightenment. >> > ^^ This resonated with the 'alternate definition of mahout' from >Hebrew- >> > which may be something akin to essence or truth. >> > >> > It is a circle- which plays to the Samsara theme. >> > >> > It is very expressive, a simple one or two brush stroke circle >which >> > symbolizes several large concepts and things about the creator, >> expressive >> > like our DSL (I feel gross comparing such a symbol to a Scala DSL, >but >> I'm >> > spit balling here, please forgive me- I am not so expressive). >> > >> > "Once the *ensō* is drawn, one does not change it. It evidences the >> > character of its creator and the context of its creation in a >brief, >> > contiguous period of time." Which reminds me of the DRMs >> > >> > In closed form it represents something akin to Plato's perfection- >which >> a >> > little more wiki surfing tells me is the idea that no one can >create a >> > perfect circle because a circle is a collection of infinite points >and >> how >> > could ever be sure that you have arranged each one properly, yet >such >> > things must exist, or what blueprint would a creator of circles be >> striving >> > for. This, by-the-by reminds me of stochastic approaches to >solving >> > problems, and really statistics / "machine-learning" in general, in >that >> we >> > can't find perfect solutions, yet we believe solutions exist and >serve as >> > our blueprint. >> > >> > Finally, I like that it is simple. >> > >> > Things I don't like about it: >> > Lucent Technologies used it back in the 90s, however they used a >very >> > specific red one, and this isn't a deal breaker for me. >> > >> > Other thoughts: >> > Based on the tattoo I saw- one could make an Enso using old mahout >color >> > palatte if one were to dab their brush in the appropriate colors. >This >> > could also be represented in any single color. (Not sure what that >does >> to >> > our TM, is it ok if we just keep slapping TMs on the side of it? If >that >> is >> > the case is there any reason we must have a single Enso?) >> > >> > So there is something to throw in the pot that is a little more >grown up >> > than my runner up favorites (honey badger, blueman riding bomb >waving >> > cowboy hat, blueman riding lighting bolt into a squirrel covered in >> water, >> > etc). >> > >> > Again, only know what wiki has told me, so if anyone is more >familiar >> with >> > this symbol (like was it used as a logo by some horrible dictator >which >> > carried out terrible attrocities?) or just general comments. >> > tg >> > >> > >> > >> > Trevor Grant >> > Data
RE: Welcome our GSoC Student Aditya Sarma
Hi Aditya, Welcome. Great to have you here. Isabel -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
Re: Marketing
One more thing: what was really helpful in spreading the word in the early days was collecting real user stories: who achieved what with Mahout. Could be helpful for the new multi backend version as well. Imagine quotes like "we've successfully used Mahout on $insertBackendHere to solve $insertSuperDuperCoolUsecaseHere in no time" says $name, CTO of $hotNewStartup in an article about the project. Warning: this is tedious work, involves monitoring Twitter, having a Google alert for the name and talking to any number of people over long periods of time to nudge them go public with their potentially confidential story. Am 30. März 2017 01:03:31 MESZ schrieb Isabel Drost-Fromm <isa...@apache.org>: >That is an awesome second interpretation. > >Having voted on the original name I'm 100% biased so take my opinion >with a huge grain of salt: on the one hand I think name changes are >over rated (anyone remember ethereal?), on the other hand IMHO Mahout >is a fairly strong brand representing machine learning at scale. > >Maybe a combination of any of a new logo, design, documentation, >release that drops the zero in "0.x.y", a press release for that >release that Sally can help you with, a new front page that publishes >the new focus of development, maybe a few snippets on that shift in >focus that editors can use, dropping deprecated code would already go a >long way... Just some random ideas. > >Isabel > > >Am 25. März 2017 03:21:50 MEZ schrieb Ted Dunning ><ted.dunn...@gmail.com>: >>On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel <p...@occamsmachete.com> >>wrote: >> >>> maybe we should drop the name Mahout altogether. >> >> >>I have been told that there is a cool secondary interpretation of >>Mahout as >>well. >> >>I think that the Hebrew word is pronounced roughly like Mahout. >> >>מַהוּת >> >>The cool thing is that this word means "essence" or possibly "truth". >>So >>regardless of the guy riding the elephant, Mahout still has something >>to be >>said for it. >> >>(I have no Hebrew, btw) >>(real speakers may want to comment here) > >-- >Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail >gesendet. -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
Re: Marketing
That is an awesome second interpretation. Having voted on the original name I'm 100% biased so take my opinion with a huge grain of salt: on the one hand I think name changes are over rated (anyone remember ethereal?), on the other hand IMHO Mahout is a fairly strong brand representing machine learning at scale. Maybe a combination of any of a new logo, design, documentation, release that drops the zero in "0.x.y", a press release for that release that Sally can help you with, a new front page that publishes the new focus of development, maybe a few snippets on that shift in focus that editors can use, dropping deprecated code would already go a long way... Just some random ideas. Isabel Am 25. März 2017 03:21:50 MEZ schrieb Ted Dunning: >On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel >wrote: > >> maybe we should drop the name Mahout altogether. > > >I have been told that there is a cool secondary interpretation of >Mahout as >well. > >I think that the Hebrew word is pronounced roughly like Mahout. > >מַהוּת > >The cool thing is that this word means "essence" or possibly "truth". >So >regardless of the guy riding the elephant, Mahout still has something >to be >said for it. > >(I have no Hebrew, btw) >(real speakers may want to comment here) -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation
On Tue, Jan 31, 2017 at 04:06:36PM -0800, Dmitriy Lyubimov wrote: > Except for a several applied > off-the-shelves, Mahout has not (hopefully just yet) developed a > comprehensive set of things to use. Do you think there would be value in having that? Funding aside, would now be a good time to develop that or do you think Samsara needs more work before starting to work on that? If there's value/ good timing: Do you think it would be possible to mentor downstream users to help get this done? And a question to those still reading this list: Would you be interested an able (time-wise) to help out here? > The off-the-shelves currently are cross-occurrence recommendations (which > still require real time serving component taken from elsewhere), svd-pca, > some algebra, and Naive/complement Bayes at scale. > > Most of the bigger companies i worked for never deal with completely the > off-the-shelf open source solutions. It always requires more understanding > of their problem. (E.g., much as COO recommender is wonderful, i don't > think Netflix would entertain taking Mahout's COO run on it verbatim). Makes total sense to me. Would be possible to build a base system that performs ok and can be extended such that is performs fantastically with a bit of extra secret sauce? > It is quite common that companies invest in their own specific > understanding of their problem and requirements and a specific solution to > their problem through iterative experimentation with different > methodologies, most of which are either new-ish enough or proprietary > enough that public solution does not exist. While that does make a lot of sense, what I'm asking myself over and over is this: Back when I was more active on this list there was a pattern in the questions being asked. Often people were looking for recommenders, fraud detection, event detection. Is there still such a pattern? If so it would be interesting to think which of those problems are wide spread enough that offering a standard package integrated from data ingestion to prediction would make sense. > That latter case was pretty much motivation for Samsara. If you are a > practitioner solving numerical problems thru experimentation cycle, Mahout > is much more useful than any of the off-the-shelf collections. +1 This is also why I think focussing on Samsara and focussing on making that stable and scalable makes a lot of sense. The reason why I dug out this old thread comes from a slightly different angle: We seem to have a solid base. But it's only really useful for a limited set of experts. It will be hard to draw new contributors and committers from that set of users (it will IMHO even be hard to find many users who are that skilled). What I'm asking myself is if we should and can do something to make Mahout useful for those who don't have that background. > > perspective? If so, would there be interest among the Mahout committers to > > help > > users publicly create docs/examples/modules to support these use cases? > > > > yes Where do we start? ;) Isabel
Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation
Hi, On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote: > and we're thinking about just how many pre-built algorithms we > should include in the library versus working on performance behind the > scenes. To pick this question up: I've been watching Mahout from a distance for quite some time. So from what limited background I have of Samsara I really like it's approach to be able to run on more than one execution engine. To give some advise to downstream users in the field - what would be your advise for people tasked with concrete use cases (stuff like fraud detection, anomaly detection, learning search ranking functions, building a recommender system)? Is that something that can still be done with Mahout? What would it take to get from raw data to finished system? Is there something we can do to help users get that accomplished? Is there even interest from users in such a use case based perspective? If so, would there be interest among the Mahout committers to help users publicly create docs/examples/modules to support these use cases? Isabel
Berlin Buzzwords 2014: CfP is open
I'm super happy to announce that the call for submissions for Berlin Buzzwords 2013 is open. For those who don't know the conference - in my absolutely objective opinion the event is the most exciting conference on storing, processing and searching large amounts of digital data for engineers. The 5th edition of Berlin Buzzwords will take place on May 25-28, 2014 at Kulturbrauerei Berlin. Berlin Buzzwords is looking for speakers who submit talks on the following topics: * Information Retrieval / Search i.e. Lucene, Solr, katta, ElasticSearch or comparable solutions * NoSQL and SQL i.e. CouchDB, MongoDB, Jackrabbit, Hbase and others * Large Data Processing i.e. Hadoop itself, MapReduce, Cascading, Pig, Spark and friends Closely related topics not explicity listed above are welcome as well. The Call for Submissions will be open until February 9! Be part of Berlin Buzzwords and submit your session idea. Please register here: http://berlinbuzzwords.de/call-submissions. Looking forward to lots of interesting proposals - and looking forward to meeting all of you in Berlin later this year (did I mention that Berlin rocks in summer?) Isabel PS: As always, any help with spreading the word is highly welcome. PS2: One final hint - even though speakers of course get a complimentary conference pass make sure to still check out our ticket page in particular if you'd like to bring your children to the conference - we do provide child day care on a donation basis but need your registration for capacity planning: http://berlinbuzzwords.de/tickets
Re: java.lang.NoClassDefFoundError: com/google/common/base/Preconditions
On Thu, 28 Nov 2013 13:24:26 +0530 Tharindu Rusira tharindurus...@gmail.com wrote: Yes that's the exact issue Suneel, it was a careless mistake while adding projects to Eclipse that I missed those .jars. When changing Mahout code make sure to either run mvn eclipse:eclipse before importing the project into your workspace or enable maven support in Eclipse. When integrating Mahout into your project it's best to use Maven, Ivy, Gradle or some other build system that supports resolving transitive dependencies automatically to avoid these issues. Isabel
Re: Mahout fpg
On Fri, 22 Nov 2013 17:55:13 +0800 Jason Lee wua...@gmail.com wrote: I noticed lots of algorithms implementations has deprecated in Mahout 0.8 and removed in 0.9, but no reasons or comments been marked. Can i ask why? As Suneel mentioned earlier: Before removing these algorithms we asked on the user list for input on what users really needed. If you need anything that was marked deprecated you are welcome to step up, provide patches and improvements to re-vive implementations that are currently in the danger of being deleted soon. Btw, Mahout API is a little lack javadoc comments, every contributors of Mahout should has the responsibility to add more javadoc comments to the java file they created. Not an excuse but maybe a step forward: If you find classes and packages lacking documentation that you know well (or are in the process of getting to know well) we'd be grateful if you could provide the missing documentation as a patch to the code base*. Isabel * Also in my experience documentation patches tend to be easier to get approval for from your employer than donating whole new implementations that you have developed internally...
Re: Could OpenNLP use Mahout for classification?
Hi Jörn, On Tuesday, April 09, 2013 10:12:47 PM Jörn Kottmann wrote: Logistic Regression (is that similar to our maxent ?) Online Passive Aggressive HMM The datasets we are training OpenNLP are usually rather small and can easily be processed with a single CPU, does Mahout support training on small scale datasets as well? In particular the Logistic Regression and HMM stuff should be well suitable even for smaller data sets. You can find the JavaDoc for each there: https://builds.apache.org//job/Mahout- Quality/javadoc/org/apache/mahout/classifier/sgd/package-summary.html and here: https://builds.apache.org//job/Mahout- Quality/javadoc/org/apache/mahout/classifier/sequencelearning/hmm/package- summary.html Both have versions that can run standalone on a single box - they may come with Hadoop as a dependency, mainly for serializing vectors and matrices to disk but not for computation distribution. Isabel
[OT] Any Mahout ppl interested in meeting for a drink or two in Sydney on Friday?
Hi, on Friday, Sep 28th I'll be meeting with a few other Apache ppl in Sydney for dinner and a drink or two. Let me know if you are interested in joining us. Isabel
Re: NetflixRecommender Data
On 01.06.2012 Sean Owen wrote: It is no longer officially available because of the lawsuit against Netflix. Hmm - when thinking of our tidying up work: Should we than remove the example? Isabel signature.asc Description: This is a digitally signed message part.
Re: Mahout + BigDataR Linux
On Fri, Jun 1, 2012 at 7:40 AM, Nicholas Kolegraff nickkolegr...@gmail.com wrote: I'm on board with this. This has been a common suggestion from more advanced users (and makes sense). I am exploring how to incorporate packages into the build process, I don't want to commit to anything, yet, but plan to take a much deeper dive mid July. Some information that might help you: The Debian new maintainers guide: http://www.debian.org/doc/manuals/maint-guide/index.en.html The Debian wiki on how to package Java projects including information on how to package maven-built software: http://wiki.debian.org/Java/Packaging There also is a mailing list for more discussion on problems and questions related to packaging java projects into Debian: http://lists.debian.org/debian-java/ One word of warning: You might run into one issue or another as Java projects usually aren't build in a way that's particularly amenable to turn them into distribution packages right away. However it should help that Mahout is maven built and relies on standard libraries only. Cheers, Isabel
Re: Mahout + BigDataR Linux
On 03.05.2012 Ted Dunning wrote: As a point of strategy, wouldn't have better to just build a debian package repository and a script for installing packages? Or go even one step further and provide real Debian packages? Isabel signature.asc Description: This is a digitally signed message part.
Berlin Buzzwords program is online
This is to announce the Berlin Buzzwords program. The Program Committee has completed reviewing all submissions and set up the schedule containing a great lineup of speakers for this years Berlin Buzzwords program. Among the speakers we have Leslie Hawthorn (Red Hat), Alex Lloyd (Google), Michael Busch (Twitter) as well as Nicolas Spiegelberg (Facebook). Checkout our program at http://berlinbuzzwords.de/program/session-schedule Berlin Buzzwords standard conference tickets are still available. Note that we also offer a special rate for groups of 5 and more attendees with a 15% discount off the standard ticket price. “Berlin Buzzwords is by far one of the best conferences around if you care about search, distributed systems, and NoSQL...” says Shay Banon, founder of ElasticSearch. Berlin Buzzwords will take place June 4th and 5th 2012 at Urania Berlin (http://www.uraniaberlin.de). The 3rd edition of the conference for developers and users of open source projects, again focuses on everything related to scalable search, data-analysis in the cloud and NoSQL-databases. We are bringing together developers, scientists, and analysts working on innovative technologies for storing, analysing and searching today's massive amounts of digital data. Berlin Buzzwords is organised by newthinking communications GmbH in collaboration with Isabel Drost (Member of the Apache Software Foundation, PMC member Apache community development and co-founder of Apache Mahout), Jan Lehnardt (PMC member Apache CouchDB) and Simon Willnauer (PMC member Apache Lucene). More information including speaker interviews, ticket sales, press information as well as meet me at bbuzz buttons are available on the official website: http://berlinbuzzwords.de/ Looking forward to meeting you in June, Isabel PS: Did I mention that Berlin is all beautiful in Summer? signature.asc Description: This is a digitally signed message part.
Re: Slides for Talk on BedCon
On 31.03.2012 Manuel Blechschmidt wrote: you can find my slides for my presentation based on Mahout and Java EE for the Berlin Expert Days 2012 here: https://github.com/ManuelB/facebook-recommender-demo/raw/master/docs/Talk-B edCon-Berlin-2012.pdf Thanks for sharing - I added them here: https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks Isabel signature.asc Description: This is a digitally signed message part.
Re: Mahout at Twitter
On 05.04.2012 Jake Mannix wrote: changing subject line to split off from Sean's Myrrix discussion On Thu, Apr 5, 2012 at 1:28 AM, Dan Brickley dan...@danbri.org wrote: On 5 April 2012 00:18, Jake Mannix jake.man...@gmail.com wrote: +1 to everything Ted said. As an added point, while we're on the subject of corporate involvement, forks, and extensions of Mahout, now is as good a time as any to announce that I (and my teammate Andy Schlaikjer) are maintaining a official Twitter fork of Mahout (hosted and worked on entirely in the open on GitHub: http://github.com/twitter/mahout ), which we'll be making patches off of to submit back to Apache trunk on a periodic basis. [...] Jake, care to add some appropriate brief mention to https://cwiki.apache.org/MAHOUT/powered-by-mahout.html ? Knowing that Twitter make serious use of Mahout adds a lot of credibility to the project, and I'm sure would be enough additional information to tip various others over into more seriously considering adoption. I thought about linking your previous email mention of it, ... but you'd know better what to say and/or link to. done! Awesome - Thanks! It's always great to see more entries to this list. Isabel signature.asc Description: This is a digitally signed message part.
Re: Commercializing Mahout: the Myrrix recommender platform
On 05.04.2012 Sean Owen wrote: On Wed, Apr 4, 2012 at 11:43 PM, Darren Govoni dar...@ontrenet.com wrote: The short answer is that they have to open their source. So anything they do to the original code is readily available to all. Not with the Apache license... it's not copyleft. The GNU license might require this. AFAIK and IANAL: Even neither the GNU General Public License nor the GNU Lesser General Public License require modifiers to open their source to just anyone. Only in case they hand the resulting binary over to someone else those modifications need to be given to said person and made available under the same original license in an effort to give the receiver of your binary the same rights that you initially built your works on. There is no need to make these modifications available to the general public - although that might turn out to be the most pragmatic solution. Isabel signature.asc Description: This is a digitally signed message part.
Re: Error Running mahout-core-0.5-job.jar
On 22.03.2012 Paritosh Ranjan wrote: You can also use HadoopUtil.delete(conf, paths) api or use the -ow (override) flag ( if available for that job). If that flag isn't available for the job you are looking at, that might be a good chance to submit a bug report and mark it as suitable for beginners - just mark it as MAHOUT_INTRO_CONTRIBUTE in JIRA. Isabel signature.asc Description: This is a digitally signed message part.
Re: How to add classes into mahout-score-0.5-job.jar?
On 22.03.2012 jeanbabyxu wrote: From Chapter 6 of Mahout in Action (page 111) But were you to use your own implementation, you would need to add it and any of its dependent classes into he JAR file as well. This can be accomplished with jar uf mahout-core-0.5-job.jar -C [classes directory] My question is : how to find out the directory for the dependent classes? This description explains how to add your own classes that you have implemented to the classpath - e.g. in cases where you want to use your own distance implementation rather than those provided with Mahout. I'm not sure this is what you are looking for. What do you want to accomplish? Isabel signature.asc Description: This is a digitally signed message part.
Re: Mahout 0.6 Naive Bayes Accuracy
On 27.03.2012 Dimitri Goldin wrote: Having tried Mallets naive bayes implementation we achieved ~95% accuracy without having to balance the training-data. Does anybody know which implementation detail might cause this or why balance seems influence mahouts implementation much more? Without knowing the Mallet implementation: You describe that you tried using two tokenizations for your Mahout runs - what are you using when running Mallet? Which Naive Bayes implementation in Mahout did you use? Did you also try running with the complementary naive bayes implementation or the logistic regression instead? Isabel signature.asc Description: This is a digitally signed message part.
Re: options for finding smallest eigenvectors
On 28.03.2012 Dmitriy Lyubimov wrote: Nathan Halko's thesis did detailed comparisons of singular values between Mahout's Lanczos and SSVD. You can look up a link to his dissertation on this list archive. (or perhaps he mentioned it @dev, can't remember on top of my head). When you find it - could you please add it there: https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks Maybe add a separate section for scientific publications involving Mahout. Isabel signature.asc Description: This is a digitally signed message part.
Re: BedCon Talk: Easy Java EE 6 war with Mahout bundled 0.6
On 20.02.2012 Manuel Blechschmidt wrote: I am going to give a talk about setting up mahout in a Java EE environment: http://bed-con.org/talks/how-to-build-a-recommender-system-based-on-mda-gwt -mahout-and-java-ee/ I created in my eyes the smallest possible demo for a recommender with an as easy as possible set up including a small sample of my facebook friends. Sounds like an interesting talk - would you mind sharing the slides afterwards? Isabel signature.asc Description: This is a digitally signed message part.
Re: dataset for recommendations and Hidden markov chains
On 06.01.2012 rahul raghavendhra wrote: hi all, Can u suggest me the dataset for hidden markov chains and recommendations.. Please check out our wiki (link on mahout.apache.org is called documentation) there is a list of datasets as well as several pages on how to start and run the various algorithms in Mahout. Isabel signature.asc Description: This is a digitally signed message part.
Re: Mahout Installation without Building From Maven
Just two minor comments: On 28.12.2011 Lance Norskog wrote: In general you are better off using the full source distribution. Binaries are provided for convienience only - however you should be fine if you only want to use the jars and job.jars. There are some apps and scripts that can help you and these are not packaged into the Maven binary distribution. Also, 0.5 is an old release and there have been a lot of changes since then. In case you have the cycles to try out trunk - any feedback is highly welcome as we are in the process of getting a new version released soon. Isabel signature.asc Description: This is a digitally signed message part.
Re: Generating vectors from custom source
On 16.12.2011 Dale McDiarmid wrote: Could someone please share some code from a similar requirement to get me started - an example reading a csv file for example. There is also some documentation on the wiki: https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html https://cwiki.apache.org/MAHOUT/lda-commandline.html Isabel signature.asc Description: This is a digitally signed message part.
Re: Problems reading solr index
On 19.12.2011 Billy Newport wrote: Any place to download the current snapshot? I'm firewalled here so I can't get Svn access. Just for reference: https://repository.apache.org/ should have recent snapshots Isabel signature.asc Description: This is a digitally signed message part.
Re: ANN: The Mahout Recommender Plugin 0.5.1 Released
On 20.12.2011 Chee Kin Lim wrote: Please see release note at http://limcheekin.blogspot.com/2011/12/mahout-recommender-plugin-051-releas ed.html Thanks for integrating the recommender part of Mahout into Grails. Any feedback on how to better support integration into 3rd party components is highly welcome. On a related note: Do you think it might also makes sense to integrate any other parts of Mahout? Isabel signature.asc Description: This is a digitally signed message part.
Re: Weighted Naive Bayes Algorithm
On 20.12.2011 Ramprakash Ramamoorthy wrote: I am using Naive Bayes classifier for my sentiment analysis on customer support. But unfortunately I don't have huge annotated data sets in the customer support domain. If your training set is small - why not use e.g. SGD instead of Naive Bayes? Isabel signature.asc Description: This is a digitally signed message part.
Re: Map/Reduce for mahout SGD Classification
On 21.12.2011 Ted Dunning wrote: On Tue, Dec 20, 2011 at 11:06 PM, selva selvai...@gmail.com wrote: When will map/reduce release for mahout SGD Classification? Probably 0.6 When will mahout 0.6 release ? Q1 of 2012 Valid for both: If you need the functionality faster - Any helping hand even if it just involves testing the patch is welcome. Isabel signature.asc Description: This is a digitally signed message part.
Re: Austin SIGKDD - Next Meeting Wednesday, December 14, 2011, 7:00 - 8:00 pm
On 14.12.2011 David Boney wrote: Sure, we are studying machine learning using Mahout. We have started a weekly hackers dojo to learn how to implement Hadoop based machine learning programs using Mahout. Once the group get some experience using Mahout, we are going to focus on projects to add functionality to Mahout. While you are still in the new-user-trying-to-figure-stuff-out mode - would be great if you could point out any documentation that lacks more detail - or maybe even fix it. Also instead of adding new functionality it would be great if you could also concentrate on better integration and streamlining - I guess you are looking at various parts of Mahout right now. Isabel signature.asc Description: This is a digitally signed message part.
Re: newbie design question
On 08.12.2011 ajinkya wrote: I am struggling in the mountain of tutorials and documentations... need some design help. There are two wiki pages that should help you get started: https://cwiki.apache.org/confluence/display/MAHOUT/Quickstart (the chapter on clustering has some examples) https://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (gives more detailed instructions) Isabel signature.asc Description: This is a digitally signed message part.
Re: Frequent itemset mining
On 02.12.2011 Tom Pierce wrote: These programs are actually exposed though the main mahout program; if you run: $MAHOUT_HOME/bin/mahout fpg it will run the Frequent Pattern Growth algorithm (aka frequent itemset mining). Also there is quite some documentation on the wiki: https://cwiki.apache.org/MAHOUT/parallel-frequent-pattern-mining.html (also includes a link to the original research publication). Isabel signature.asc Description: This is a digitally signed message part.
Re: LDA clustering example not working
On 02.12.2011 Chris Grier wrote: Caused by: java.io.IOException: Cannot open filename /tmp/mahout-work-hadoop/reuters-out-seqdir-sparse-lda/tf-vectors/_logs Are you providing the correct input directory here? On first sight it seems to think that the logs dir contains the tf-vectors. On a related note: If you are working with LDA - did you try out Jake's new implementation? Would be great to get more feedback on that one. Isabel signature.asc Description: This is a digitally signed message part.
Re: Grant's developerworks article a weekly highlight
On 02.12.2011 Ted Dunning wrote: We knew it was a highlight, but IBM seems to agree! http://www.ibm.com/developerworks/podcast/twodw-110911/ Grant: Congratulations! Isabel signature.asc Description: This is a digitally signed message part.
Re: Relevance score - Classification
On 29.11.2011 Faizan(Aroha) wrote: In our case, I think we won't be looking much into features I am moving towards clustering as Tantons's mentioned. Hmm - what kind of similarity measure are you planning to use for that? What makes to items be similar in your use case? Isabel signature.asc Description: This is a digitally signed message part.
Re: LDATopic
On 28.11.2011 bish maten wrote: mahout ldatopics -i mahout-work/abc/abc-lda/state-20 -d mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0 -dt sequencefile (there were no errors reported and command worked fine with following output). Does the output appear ok? Hmm - this only prints the resulting LDA topics - which command did you use to generate them? Please also note that Jake is currently working on improving our LDA support, if you are interested in that algorithm it might be interesting for you to look into his patch in https://issues.apache.org/jira/browse/MAHOUT-897 Isabel signature.asc Description: This is a digitally signed message part.
Re: Mahout distribution download
On 28.11.2011 Sean Owen wrote: There is no newer distribution, but, you can always check out the very latest from Subversion: https://cwiki.apache.org/confluence/display/MAHOUT/Version+Control Also we do publish nightly builds at the Apache Maven-Snapshot repository. If you would like to help shorten the time it takes until the next release please check any open issures tagged as 0.6: https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12316364 Isabel signature.asc Description: This is a digitally signed message part.
Re: Data class taxonomy for machine learning
On 29.11.2011 Ted Dunning wrote: I find this taxonomy excessive and over-done. The distinctions I find useful include - continuous variables - discrete variables with a known set of values (I call these categorical, usually). This includes ordinal variables since ordering rarely makes a lot of difference. - discrete variables with a large or not well known set of possible values (I call these word-like) - bags or lists of word-like variables (I call these text-like) What I found useful for explaining which data types to expect:: http://www.cs.uni- potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf (Slide 6, unfortunately in German only) What seemed more needed was an explanation of different problem settings and how to tackle them on a very high level: http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf Isabel signature.asc Description: This is a digitally signed message part.
Re: Clustering - Sequence File from Directory
On 30.11.2011 Faizan(Aroha) wrote: Would anyone please give any hint? On Running the following command: bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles I'm getting the following error: MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. MAHOUT_LOCAL is set, running locally That means the job will run locally only, don't expect any jobs to appear in your Hadoop jobtracker. Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver Did you build all of mahout before running the command line tool? Isabel signature.asc Description: This is a digitally signed message part.
Re: org.apache.maven.plugins:maven-antrun-plugin:1.6:run grief, copy-dependencies and unpack goals not supported by m2e, importing mahout into Eclipse
On 27.11.2011 Mike Spreitzer wrote: I presume you are speaking to the issue of the goals not supported. Note that I also have another problem, No marketplace entries found to handle maven-antrun-plugin:1.6:run in Eclipse. Any clues about that one? To me that sounds like an Eclipse problem (or a problem with the maven integration of Eclipse). Isabel signature.asc Description: This is a digitally signed message part.
Re: Scalable graph clustering implementation
First of all, your findings sound very interesting - thanks for sharing. On 26.11.2011 Bae, Jae Hyeon wrote: I want to contribute my implementation to Mahout if it is available and allowed. Please let me know how I can follow up. The easiest starting point would be our how to contribute wiki page: https://cwiki.apache.org/MAHOUT/how-to-contribute.html Please keep in mind that contributing whole new packages and algorithms may be a bit more involved: Apart from the implementation itself including unit tests there also is a need for a running example that shows how to use your code and at least some sort of quickstart documentation in our wiki. Isabel signature.asc Description: This is a digitally signed message part.
Re: Sequential Pattern Mining
On 27.11.2011 Nishant Chandra wrote: I want to identify rules such as: after acquiring product 1 and then product 3, customers have an increased likelihood (75%) of purchasing product 4 next. What is your goal with discovering these rules? Assuming what you want is implementing a feature that recommends items to customers they are likely to buy: Did you check the fpgrowth implementation already? Though it does not cover the temporal aspect you mention it might still be of value for you as it is capable of discovering items that are typically puchased together. If you would rather personalize your offerings to the preferences of each of your customers you might be better of taking a closer look at the collaborative filtering implementations of Mahout. Isabel signature.asc Description: This is a digitally signed message part.
Re: ItemSimilarity example
On 27.11.2011 bish maten wrote: https://cwiki.apache.org/MAHOUT/recommender-documentation.html has following example // Construct the list of pre-computed correlations Collection GenericItemSimilarity.ItemItemSimilarity correlations = ...; how is actual construction done in above line ( correlations = ... ; ) If I understand the abstract above that line of code correct these item similarities should capture your personal domain knowledge about items. So how to compute them is up to your definition of what makes two items similar. Isabel signature.asc Description: This is a digitally signed message part.
Re: mahout command problems
On 27.11.2011 bish maten wrote: mvn compile done under subdirectory of mahout-distribution. Did you also run a mvn package from the mahout root directory? Isabel signature.asc Description: This is a digitally signed message part.
Re: Load Dataset and Instances from database
On 24.11.2011 Ted Dunning wrote: Actually, one of the most reliable ways to kill a database is to use it as input or output for even a small Hadoop cluster. Having hundreds of processes all open connections and read at once is fairly abusive. Though that does not mean that data cannot by synced to hdfs before being used in a map/reduce job. Tools like sqoop help with that. Isabel signature.asc Description: This is a digitally signed message part.
Re: Load Dataset and Instances from database
On 24.11.2011 Sturm, Martin wrote: Since I only want to try it out standalone I was hoping that this was possible without any Hadoop stuff. Are there any tutorials or examples available that show how to load a Dataset? Because I do not even know what files are expected here.. cvs? You may want to take a look at our quickstart wiki page for that. It explains the two examples that show how decision forrests can be used: https://cwiki.apache.org/MAHOUT/breiman-example.html https://cwiki.apache.org/MAHOUT/partial-implementation.html Isabel signature.asc Description: This is a digitally signed message part.
Re: Relevance score - Classification
On 23.11.2011 Faizan(Aroha) wrote: We are working on using Classification as a Search. I want to compute the relevance score of the output which is generated by the Naive Bayes Classifier or some other classifier. Please give any guideline/hint! Can you please provide some more background to your use case? Which documents do you want to search? How is relevance defined in your setting? Isabel signature.asc Description: This is a digitally signed message part.
Re: Error in executing mahout kmeans
On 22.11.2011 DIPESH KUMAR SINGH wrote: I ran the script and i was getting error regarding missing libraries. The error which i got is attached. Then i tried executing the commands in the script, command by command, and i figured out that error was coming in the seq2sparse step. (Prior to this step all the conversions are working fine) There seem to be problems resolving some of the dependencies used - not sure why though. You did compile the project and in that process created a job jar? What i exactly want to try is document clustering, i thought it is better to try first with Reuters dataset to get started. Are the source files of kmeans (mapper and reducer etc) are there in mahout source folder? Sure, look in the maven module core in the o.a.m.clustering package - all kmeans related code is in there. Isabel signature.asc Description: This is a digitally signed message part.
Re: Which input formats to use for classifying WEKA's ARFF format?
On 22.11.2011 HorstItUpright wrote: As far as I know, Mahout provides two Bayes algorithms and a Random Forest (which is - whyever - called Dicision Forest [which is not wrong, I know, but confusing and inconsistent to the Docs I think]). + logistic regression (to be found in the sgd package) It appears to me (and I've also taken a look into the code) that none of these approaches can handle the MVC format (which is the result, when parsing the WEKA-ARFF files with the arff-vector converter). I am not too familiar with the MVC format - is that an intermediate file format used by WEKA after parsint ARFF? The DF is even more special and requires the UCI format. DF? My question now is: am I overseeing something? Is there a way to convert the MVC files on the fly into the proper formats for the algorithms? All algorithms in Mahout are implemented to accept vectors as input format. So in order to plug in what ever input format (or database, NoSQL store, which ever other source for data you might have) all you have to do is provide glue code that converts your data into Mahout vectors. Having said that there is limited support for ARFF in Mahout already. To my knowledge that is not feature complete - any help with spotting missing features and fixing them is highly welcome. The Bayes algorithms e.g. are running with the input data, but print a lot of strange output to the console during processing and do not give any usable results. Any help with improving logging to make the project easier to use is very welcome. Would be great if you could put up a JIRA issue and attach a patch to change the code to better match your expectations to get that discussion started. Cheers, Isabel signature.asc Description: This is a digitally signed message part.
Re: Wiki edit request
On 19.11.2011 Lance Norskog wrote: Fixed. A: it moved, and B: it's Jenkins now. On Fri, Nov 18, 2011 at 6:02 PM, Dan Beaulieu danjacob.beaul...@gmail.comwrote: While on the topic, the hudson url is broken... Don't know what it should be... Dan - good catch. Lance, thanks for the fix. Isabel signature.asc Description: This is a digitally signed message part.
Re: New User to Mahout
On 12.11.2011 thinkingbigdata wrote: I want to understand it fully and want coding to be done in Java. If anyone can help me with some examples code that is using Hadoop written examples that would be really helpful. Do you have any machine learning problem you want to get started with in particular? Knowing what in particular you are interested in would make it easier to answer your question. Isabel signature.asc Description: This is a digitally signed message part.
Re: Coding format update: Eclipse Lucene conventions
On 14.11.2011 Lance Norskog wrote: The Eclipse Lucene conventions are mighty close to what we're using, much more so that the Eclipse formatting file on the How To Contribute page. So, I've uploaded the Lucene file and changed the link. Eclipse users, please try it and see if it's what we want. Thanks for that contribution. Isabel signature.asc Description: This is a digitally signed message part.
Re: mahout for enterprise search project
On 15.11.2011 Burcu Buyukkagnici wrote: Where does mahout; Lucene/solr and UIMA framework fit in the following scenario? Some more background on how search and machine learning fit together see also http://www.manning.com/ingersoll/ Also at the latest ApacheConNA Grant provided some ideas and insights on what types of problems can be solved by a search engine alone. Recordings of all talks are online at http://feathercast.org Isabel signature.asc Description: This is a digitally signed message part.
Re: Documentation
On 16.11.2011 Ted Dunning wrote: One thing that you can do is to point out the problems and even suggest or provide some improvements. Your eyes are still new and thus will see problems more clearly than ours. One thing to note: Most of the Mahout documentation is online in our wiki - that wiki essentially is public, so if you do have some time left and spot an area that you think needs improvement, please do not hesitate to add information. Also if you spot missing JavaDocs: Providing them is a very simple way to get your first patches in. Isabel signature.asc Description: This is a digitally signed message part.
Re: Austin Hacker Dojo - Big Data Machine Learning
On 17.11.2011 David Boney wrote: If at least three or four people are interested we can have an organization meeting to discuss the group name, finding a location to meet, development environment, setting up a web site, and the agenda for the first couple of months. Just a brief comment: Don't know how much interest in Big Data Machine Learning there is in Austin - however what did work in Berlin for most meetings I started in Berlin was to have a more informal gathering at first to figure out how many people would be interested - and later on decide on web site, agenda etc. Isabel signature.asc Description: This is a digitally signed message part.
Re: Large Scale Clustering
On 18.11.2011 Grant Ingersoll wrote: Might be of interest: Clustering Very Large Multi-dimensional Datasets with MapReduce http://www.cs.cmu.edu/~jclopez/ref/kdd2011-mr-clustering.pdf Judging from the abstract it looks interesting indeed. Thanks for sharing, Grant. Isabel signature.asc Description: This is a digitally signed message part.
Re: Relevance Prediction Challenge / WSDM 2012 Web Search Click Data Workshop
On 07.11.2011 Pavel Serdyukov wrote: We are pleased to announce the launch of the Relevance Prediction Challenge, which is a part of the WSDM 2012 Web Search Click Data (WSCD) workshop. This challenge provides a unique opportunity to consolidate and scrutinize the work from industrial labs on predicting the relevance of URLs using user search behavior. It provides a fully anonymized dataset shared by Yandex, which has user queries, clicks on URLs and their relevance labels. Any of our Mahout users interested in taking up that challenge? Might be a nice project also for people in the academic world working on relevance models based on user feedback. Isabel signature.asc Description: This is a digitally signed message part.
Re: does anyone use the row label bindings stuff in Vector / Matrix?
On 02.11.2011 Jake Mannix wrote: I'll leave this thread open until after work tonight (8 hrs or so from now), and if I don't hear any vociferous complaints or reasoned thoughts on why this is crazy, I'll chop 'em. +1 for the cleanup, however if you are leaving the thread open for that purpose, you might want to at least wait a day until people in all time zones had a chance to read it. Isabel signature.asc Description: This is a digitally signed message part.
Re: Production use cases of Mahout
On 01.11.2011 Josh Patterson wrote: There's a few, check out: http://www.hadoopworld.com/agenda/ The bit.ly folks always have something interesting to show. The WibiData guys are doing some interesting things with their product and recommendation. Any chance that slides/videos of the talks are going to be made public after the event? Would love to link to them from the Mahout wiki. Isabel signature.asc Description: This is a digitally signed message part.
Re: Exception in thread main org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file _b.fnm
On 20.10.2011 OldSkoolMark wrote: Exception in thread “main” org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file “_b.fnm” Not having much experience with Lucene this looks like you are trying to read the index with Lucene in a version that is older than the one the index was created with? Isabel signature.asc Description: This is a digitally signed message part.
Re: Looking for someone with experience integrating content-based approaches in Mahout
On 20.10.2011 Ted Dunning wrote: THere is also the j...@apache.org mailing list which is less focussed but might hit some folks with the right expertise that this list does not. And there is https://cwiki.apache.org/MAHOUT/professional-support.html which lists companies and people that declared themselves as willing and capable of helping Mahout users. Isabel signature.asc Description: This is a digitally signed message part.
Re: Request for Assistance.
First of all welcome also from my side. On 05.10.2011 Apurv Verma wrote: I am interested in becoming a contributor to Mahout. Actually we have a How to contribute page on our wiki that might help you: https://cwiki.apache.org/MAHOUT/how-to-contribute.html I guess the general take away is to start using Mahout for your own projects. As with any software you use sooner or later you will find stuff that bothers you: Missing documentation, extensions you need to make here and there, sutle bugs. But unfortunately I have not had any course in Machine Learning still. I am having a course in Artificial Intelligence this semester. While it is certainly a great help to have some machine learning background, you do not need a PhD to start contributing to Mahout. Any infrastructure improvements that do not change the inner algorithms but make it easier to integrate Mahout and re-use it are highly welcome. I am also *not* conversant with hadoop and mapreduce though I have heard of it and have long wanted to learn it. Can someone please guide me (mentor informally) so that I may get a sense and direction and I am able to develop the skills set required to contribute to this project within the next 6 months. You have taken a very good first step by contacting the mailing list. Try to figure out an area that you would like to use Mahout for, start working in that direction, if you come across any questions that cannot be answered by a trivial search in the mailing list archives don't be shy to ask on list. When getting more proficient answer questions other new-comers may have, start reviewing patches and maybe even contribute your own improvements. Isabel signature.asc Description: This is a digitally signed message part.
Re: Applying DataMining on Network Packets
On 04.10.2011 Sarath P R wrote: I am monitoring packet flow in a Network Interface . Now i want to make some predictions. What kind of prediction do you want to make? Actually i am not sure about what algorithm i should use and what kind of predictions i may need. I just want to know is it possible to classify network packets using Mahout Classification algorithm. Can anyone make some comments. The classification algorithms of Mahout are based on the idea of classifying items that have to be represented as multidimensional vectors and as a result are not bound to be used for just one domain. Put more simply: First think of what kinds of predictions you want to make. Then think of features that contain information on which prediction is more likely. Code these features as vectors and continue from there. A really nice explanation of this concept is explained in the Mahout in Action book. You can also take a quick look at the following slides for a general outline: http://www.user.tu-berlin.de/konrad.rieck/pubs.html signature.asc Description: This is a digitally signed message part.
Re: Mahout testimonials
On 29.09.2011 Dan Brickley wrote: For what it's worth, we used Mahout in the NoTube EU project, and it saved a lot of time (and a brain transplant). I should blog this. The only piece we've used heavily in our apps (http://vimeo.com/user3487770 http://notube.tv/ ) [...] One nice thing about this community, is that Mahout is not over-marketed. If the nature or scale of your problem better suits other tools, the Mahout folk will tell you so. Thanks for the really nice comment. I've added you to our powered-by wiki page in the powered-by section - feel free to add any additional content as you see fit. https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout Isabel signature.asc Description: This is a digitally signed message part.
Re: Bayes/CBayes classification on a non-existing feature
On 29.09.2011 André-Philippe Paquet wrote: After checking in the CBayesAlgorithm class, I made my own subclass and overrided the featureWeight function to return 0 if the weight of the feature in the curent label is 0 instead of returning the theta normalized weight. It fixed the problem in my case. Should I fill an issue? Yes, absolutely. Your fix sounds like a nice starting point. Robin, in a second iteration, should we allow users to plug in their own strategies for weighting so far unseen features, or can we come up with one that works for all most common cases? Isabel signature.asc Description: This is a digitally signed message part.
Fwd: ApacheCon Vancouver Meetups, and other chances for your project to get involved
On 28.09.2011 Nick Burch wrote: If you're interested in hosting a Meetup, please list the idea on the Meetups wiki[2]: http://wiki.apache.org/apachecon/ApacheMeetupsNa11 If you see one there you like the look of, bump up the interested count. Once we know there's enough takers, we'll schedule the meetup and help get the word out! Also, if you think that a company in your project area might be willing to buy some beer for your meetup, please ask them to drop Delia deliafr...@gmail.com an email and she'll help them get that sorted :) Any Mahout people (in addition to Grant Ingersoll, Shannon Quinn and myself) planning to attend Apache Con NA? In terms of other chances to get together or spread the word about your project, there are a few other options. We're still seeking speakers for the Fast Feather Track, which hosts 20 minute talks about new projects, ideas and features. If there's something new in your area, sign up and let everyone know about it! Signup is here[3]: https://docs.google.com/spreadsheet/viewform?hl=en_GBformkey=dDR5ZEN0amFzZGVGdHVnQWpuSWM0bGc6MQ#gid=0 If you are a happy Mahout user and are planning to attend Apache Con - why not put in a short presentation on your Mahout use case? I'd love to learn more on what people are working on. Isabel signature.asc Description: This is a digitally signed message part.
32 Days left to Berlin Buzzwords 2011
hey folks, BerlinBuzzwords 2011 is close only 32 days left until the big Search, Store and Scale opensource crowd is gathering in Berlin on June 6th/7th. The conference again focuses on the topics search, data analysis and NoSQL. It is to take place on June 6/7th 2011 in Berlin. We are looking forward to two awesome keynote speakers who shaped the world of open source data analysis: Doug Cutting, founder of Apache Lucene and Hadoop) as well as Ted Dunning (Chief Application Architect at MapR Technologies and active developer at Apache Hadoop and Mahout). We are amazed by the amount and quality of the talk submissions we got. As a result this year we have added one more track to the main conference. If you haven't done so already, make sure to book your ticket now - early bird tickets are already sold out since April 7th and there might not be many tickets left. As we would like to give visitors of our main conference a reason to stay in town for the whole week, we have been talking to local co-working spaces and companies asking them for free space and WiFi to host Hackathons right after the main conference - that is on June 8th through 10th. If you would like to gather with fellow developers and users of your project, fix bugs together, hack on new features or give users a hands-on introduction to your tools, please submit your workshop proposal to our wiki: http://berlinbuzzwords.de/node/428 Please note that slots are assigned on a first come first serve basis. We are doing our best to get you connected, however space is limited. The deal is simple: We get you in touch with a conference room provider. Your event gets promoted in our schedule. Co-Ordination however is completely up to you: Make sure to provide an interesting abstract, provide a Hackathon registration area - see the Barcamp page for a good example: http://berlinbuzzwords.de/wiki/barcamp Attending Hackathons requires a Berlin Buzzwords ticket and (then free) registration at the Hackathon in question. Hope I see you all around in Berlin, Isabel signature.asc Description: This is a digitally signed message part.
Re: Recommended reading
On Mon, 28 Mar 11 Dan Brickley wrote: I've collected up much of the text from this mail thread and added it to the wiki at https://cwiki.apache.org/confluence/display/MAHOUT/Reference+Reading I've added links where I could find them, wikified the voice a little (downplayed opinions and some detail), but otherwise the text is from this thread. The page currently is a bit awkward since I appended a large body of text to a pre-existing small entry, but it seemed better than adding a new page. But I'd rather circulate it as-is now than leave this on a 'someday pile', so ... there you go. Hope it's useful and that others are in the mood to jump in and polish / improve the page. Thanks so much for going to the effort of adding this information to the wiki page - sure it needs some polish, however it's nice to see some of the wisdom commonly found on the mailing list transferred over to the wiki. Isabel
BerlinBuzzwords 2011 Early Bird Ticket Period ends on April 7th.
Hey folks, just a short notice for those who haven't noticed we have only a limited amount of Early-Bird tickets left and the Early-Bird period is ends on April 7th. If you want to get one of the 30 remaining tickets go and get one now here: http://berlinbuzzwords.de/content/tickets While we are still working on the schedule and selecting speakers we didn't send out any reject mail yet. So if you have submitted a talk for BerlinBuzzwords 2011 you don't need to get a Early-Bird ticket now. All potential speakers will be eligible for Early-Bird discount even after April 7th. regards, Isabel signature.asc Description: This is a digitally signed message part.
Re: Automatically extracted Mahout FAQs
On Wed, 23 Feb 11 Sean Owen wrote: Nice, very interesting to see and read! Very interesting indeed. Wondering whether creating a Top 10 of the most frequently asked questions could be created that way as well. Isabel
Re: Apache Mahout Hackathon - Berlin - Feb 2011
On Tue, 14 Dec 10 Isabel Drost wrote: early 2011 - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base in Berlin. Just a brief reminder - that is this weekend. We are going to start with a brief barcamp-like brainstorming session to find out what people actually want to work on during the course of the weekend. After that participants are welcome to join break-out sessions or work on their own projects. Please don't forget to bring your own ideas. Please remember to bring your own equipment. There is a bar, so no need to bring drinks. There are several restaurants near by, so don't worry about not getting anything to eat ;) http://tinyurl.com/6d9lc9z Isabel
Re: Two learning competitions that might be of interest for Mahout
On Fri, 11 Feb 11 Markus Weimer wrote: go for it! I'd do it myself but the rules we wrote prohibit me from doing so ;-) I am pretty sure these rules only forbid you entering and trying to win the competition - can't imagine that you are forbidden to run Mahout against the competition data, and maybe publish the results after the contest is over ;) Isabel
Two learning competitions that might be of interest for Mahout
http://www.kdd.org/kdd2011/kddcup.shtml KDD-Cup 2011: Recommending Music Items based on the Yahoo! Music Dataset We challenge participants to identify user tastes in music by analyzing real ratings of Yahoo! Music anonymized users. The dataset represents a snapshot of the community's preferences for various musical items. http://www.heritagehealthprize.com/competition.php The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data. Isabel
CFP - Berlin Buzzwords 2011 - Search, Score, Scale
This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany, taking place in Berlin. Call for Presentations Berlin Buzzwords http://berlinbuzzwords.de Berlin Buzzwords 2011 - Search, Store, Scale 6/7 June 2011 The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics: * IR / Search - Lucene, Solr, katta or comparable solutions * NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others * Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives * Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies. Important Dates (all dates in GMT +2) * Submission deadline: March 1st 2011, 23:59 MEZ * Notification of accepted speakers: March 22th, 2011, MEZ. * Publication of final schedule: April 5th, 2011. * Conference: June 6/7. 2011 High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters. Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no later than March 1st, 2011. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a lightning (10min), short (20min) or long (40min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) If you'd like to pitch your brand new product in your talk, please let us know as well - there will be extra space for presenting new ideas, awesome products and great new projects. The presentation format is short. We will be enforcing the schedule rigorously. If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Follow @hadoopberlin on Twitter for updates. Tickets, news on the conference, and the final schedule are be published at http://berlinbuzzwords.de. Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer. Please re-distribute this CfP to people who might be interested. If you are local and wish to meet us earlier, please note that this Thursday evening there will be an Apache Hadoop Get Together (videos kindly sponsored by Cloudera, venue kindly provided for free by Zanox) featuring talks on Apache Hadoop in production as well as news on current Apache Lucene developments. Contact us at: newthinking communications GmbH Schönhauser Allee 6/7 10119 Berlin, Germany Julia Gemählich Isabel Drost +49(0)30-9210 596 signature.asc Description: This is a digitally signed message part.
Re: Interested Mahout developers in the UK (or Europe?)
On Tue, 11 Jan 2011 Sean Owen sro...@gmail.com wrote: If that describes you, you can respond to me privately and I'll make sure to make the connection when I see some interesting stuff going on here. Same here for Germany (or Europe) Please also consider adding yourself to our Professional Support wiki page https://cwiki.apache.org/confluence/display/MAHOUT/Professional+Support Isabel
Re: Adding user classes to Mahout's MR jobs.
On Tue, 11 Jan 2011 Dmitriy Lyubimov dlie...@gmail.com wrote: It's probably a little bit more of a Hadoop question though but as far as i know that's not as easy as specifying additional jars for java -cp option, is it? When using the mahout shell script it should be as easy as defining a CLASSPATH variable that contains these classes. The script should take up this variable and extend it by all dependencies Mahout itself needs. Similar setups are available when running on a Hadoop cluster. Isabel
Re: Seq2Sparse and Collocation
On Fri, 10 Dec 2010 Sreejith S srssreej...@gmail.com wrote: I have a text file and i converted it in to sequence file.Then i created sparse vectors using seq2sparse.Now i would like to take all the collocation generated. Pls say how to execute CollocDriver in command prompt. There is a description in our wiki: https://cwiki.apache.org/confluence/display/MAHOUT/Collocations In addition any driver in Mahout supports the --help option to print details on command line options. Isabel
Apache Mahout Hackathon - Berlin - Feb 2011
Hello, early 2011 - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base in Berlin. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with local Mahout committers, get your machine learning project off the ground. The venue features a bar that sells drinks (including Club Mate) so no need to bring those. Please register at https://www.xing.com/events/apache-mahout-hackathon-647603 if you are planning to attend this event so we can plan for enough space for everyone. If you have not registered for the event there is no guarantee you will be admitted. If you'd like to support the event: We'd love to provide pizza and drinks for free. If you are interested in sponsoring, please contact me at isa...@apache.org A special Thank You to c-base for providing the location free of charge. Feel free to forward this information to anyone who might be interested, tweet the event, include information on your blog if you are attending. Check the above link to learn of potential changes. Looking forward to a fun and productive weekend, Isabel signature.asc Description: This is a digitally signed message part.
DataDevRoom at the 2011 edition of the FOSDEM
Hello, We (Olivier, Nicolas and I) are organizing a Data Analytics DevRoom that will take place during the next edition of the FOSDEM in Brussels on Feb. 5. Here is the CFP: http://datadevroom.couch.it/CFP You might be interested in attending the event and take the opportunity to speak about your projects. Important Dates (all dates in GMT +2): Submission deadline: 2010-12-17 Notification of accepted speakers: 2010-12-20 Publication of final schedule: 2011-01-10 Meetup: 2011-02-05 The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics: Information retrieval / Search Large Scale data processing, Machine Learning, Text Mining, Computer vision, Linked Open Data. High quality, technical submissions are called for, ranging from principles to practice. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies. Submissions should be based on free software solutions. Looking forward to meeting you face to face in Brussels, Isabel
Re: Checkouts and branches
On 05.12.2010 Lance Norskog wrote: Where is the branch/tag for 0.4? In Mahout's repository at /tags/mahout-0.4 - see also http://svn.apache.org/viewvc/mahout/tags/mahout-0.4/ Isabel signature.asc Description: This is a digitally signed message part.
Re: Bayes Question.
On Thu, 25 Nov 2010 JAGANADH G jagana...@gmail.com wrote: Or is it enough to train with either of good or bad.? It will be something like train a person to identify 'sweet' by giving 'salt' as sample There are some domains where it may make sense to formulate a task as one-class classification problem. E.g. looking at time series data one might want to train a model to identify normal behaviour from positive data only. Though it is possible to come up with algorithms for this so-called one-class classification problem*, I am not aware of any implementation in Mahout. Isabel * For instance see One-Class SVMs for Document Classification by Larry m. Manevits and Malik Yousef for some references and comparison.
Re: classification algorithm
On Thu, 18 Nov 2010 Radu Spineanu r...@timisoara.roedu.net wrote: I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm able to wrap my head around everything and get it working I would love to contribute back and package it. That would be awesome. Mahout does have quite a few dependencies which might make it an interesting packaging exercise. I am not sure whether all of them are available in Debian already. At least Hadoop should be available in Debian testing, but did not yet make it to the latest stable release. Isabel
Re: Mahout in talk
On 12.11.2010 JAGANADH G wrote: I will be giving a talk on Machine Learning in the BarCap Kerala9 . I have included Mahout in the talk too. I will give demo of recommendation and Classification with Mahout. Would be great if you could put your slides up online in our wiki (if you'll use any slides): https://cwiki.apache.org/MAHOUT/books-tutorials-and-talks.html Isabel signature.asc Description: This is a digitally signed message part.
Re: How to Cluster?
On Fri, 22 Oct 2010 SIAVASH GHODSI MOGHADDAM gmsiava...@live.utm.my wrote: What I am looking for now, is a Clustering Code Sample. Did you have a look at the examples module of Mahout? There is also quite some documentation in the Mahout wiki to get you started. Isabel
Re: Mahout dependencies on windows
On Mon, 25 Oct 2010 22:54:42 +0100 Steven Bourke sbou...@gmail.com wrote: Ted - Has mahout got an image up on EC2 that anyone can use or do we have to build from scratch? None that I'm aware of, however building from scratch should be fairly easy: https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2 https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce Isabel
FYI: Fw: Fast Feather Track at ApacheCon - submit your talk now!
I thought the following might be a nice option to present your awesome Mahout use case to a broader audience - or maybe tell others what you did for GSoC. Begin forwarded message: Date: Mon, 4 Oct 2010 22:54:09 +0100 (BST) From: Nick Burch nick.bu...@alfresco.com To: gene...@incubator.apache.org Subject: Fast Feather Track at ApacheCon - submit your talk now! Hi All We've under a month to go now to Atlanta, and hopefully you've all registered and are all looking forward to a great week? Other than our packed schedule of talks, our growing list of meetups (see Shane's email from Friday for more details on hosting your own), what could be more exciting than your Next Big Thing? Well, that's where the Fast Feather Track comes in! The Fast Feather Track provides space for the projects that are just too new or fast-moving to fit in to the normal CFP. It's especially a great slot for new incubator projects to talk about what they're up to, or share their passion for some new technology out there. So, this is your chance of twenty minutes of fame for your incubating project :) The Fast Feather Track is all about the technology - so whether you're a novice or a natural at public speaking, there's room for you! Anything new at The Apache Software Foundation belongs here, along with new external technologies that can help us work better. Whether you're ready for showtime and want the world to know, or you're still finding your feet, and just fishing for a few new contributors / mentors, this is the slot for you! What we're after now is people to tell us what they want to talk about. We've got a room for a day, and a big empty schedule board with 20 minute slots on it, so now all we need is some talks to fill it with... We're aiming to fix most of the schedule now, but we'll probably keep a few slots spare for some last minute talks. But if you've already got your ticket, and you know what you want to talk about, please let us know now, so we can make sure there's space for everyone. To submit your talk, please head over to google docs and tell us about yourself and your talk: https://spreadsheets.google.com/viewform?formkey=dElobGxibG1oc05OeFNqRFZ1S0tpLVE6MQ We'll see you, your projects, and your great short talks in Atlanta! Nick (NB Speaking in the Fast Feather Track does not entitle you to the full range of speaker perks - you'll get a shiny badge, and your bio in the program, but you won't get your travel, room or registration comped. Doesn't make it any less fun though, we promise!) - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Mahout usage
On Thu, 30 Sep 2010 Grant Ingersoll gsing...@apache.org wrote: Now, if we could just get people to add to the Powered By page! Anyone ever successfully convinced a Mahout (or Lucene etc.) user to put their name on the Powered By? I'd be interested in learning more on the arguments that worked for others... Isabel
Re: Mahout usage
On Fri, 1 Oct 2010 Grant Ingersoll gsing...@apache.org wrote: I'm working on a few... I know they are out there, as they email in private. Same here: One huge fear that people seem to have is to reveal the inner workings of their system not only to the public but also to potential competitors by putting their name on our list. Isabel
Re: Text Classification using Mahout
On Thu, 30 Sep 2010 Sean Owen sro...@gmail.com wrote: Ignore it, it's just Maven doing its thing in the background. It should work fine without internet connectivity. To speed up the build process when you do not have internet connectivity you can give a -o to the command line to tell maven that you are not connected. That way it does not go and try to check for updates. Isabel
SGD example
Hi, I just tried running the SGD example with the following command line (adapted from the corresponding JIRA issue): ./bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 50 --lambda 0.001 --input examples/src/main/resources/donut.csv --features 21 -- output donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n When running the code above I ran into a few NullPointerExceptions - I was able to fix them with a few tiny changes. If not stripped they should be attached to this mail to highlight the lines of code that caused the trouble. However I was wondering whether I simply used the wrong command line. Isabel diff --git a/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java b/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java index 5cbdef2..bde3021 100644 --- a/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java +++ b/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java @@ -243,8 +243,9 @@ public class CsvRecordFactory implements RecordFactory { if (predictor = 0) { value = values.get(predictor); } else { -value = null; +value = null; } +System.out.println(value); predictorEncoders.get(predictor).addToVector(value, featureVector); } return targetValue; diff --git a/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java b/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java index d76fd81..3112681 100644 --- a/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java +++ b/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java @@ -34,7 +34,7 @@ public class ConstantValueEncoder extends FeatureVectorEncoder { for (int i = 0; i probes; i++) { int n = hashForProbe(originalForm, data.size(), name, i); if(isTraceEnabled()){ -trace((byte[]) null, n); +trace(new byte[]{}, n); } data.set(n, data.get(n) + getWeight(originalForm,weight)); } diff --git a/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java b/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java index 30cd353..3f7d1d5 100644 --- a/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java +++ b/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java @@ -132,6 +132,8 @@ public final class TrainLogistic { private static double predictorWeight(OnlineLogisticRegression lr, int row, RecordFactory csv, String predictor) { double weight = 0; +if (csv.getTraceDictionary().get(predictor) == null) + return 0; for (Integer column : csv.getTraceDictionary().get(predictor)) { weight += lr.getBeta().get(row, column); } signature.asc Description: This is a digitally signed message part.
Apache Hadoop Get Together Berlin October 2010 - this time with a huge Mahout focus
Hello, this is to announce the next Apache Hadoop Get Together sponsored by JTeam (http://www.jteam.nl) that will take place in newthinking store in Berlin. When: October 7th, 5p.m. Where: Newthinking store Berlin As always there will be slots of 30min each for talks on your Hadoop topic. After each talk there will be a lot time to discuss. You can order drinks directly at the bar in the newthinking store. If you like, you can order pizza. We will go to Cafe Aufsturz after the event for some beer and something to eat. Talks scheduled so far: Max Heimel: Hidden Markov Models for Apache Mahout Abstract: In this talk I will present and discuss an implementation of a powerful statistical tool called Hidden Markov Models for the Apache Mahout project. Hidden Markov models allow to mathematically deduce the structure of an underlying - and unobservable - process based on the structure of the produced data. Hidden Markov Models are thus frequently applied in pattern recognition to deduce structures that are not directly observable. Examples for applications of Hidden Markov Models include the recognition of syllables in speech recordings, handwritten letter recognition and part-of-speech tagging. Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout Abstract: Recommendation Mining helps users find items they like. A very popular way to implement this is by using Collaborative Filtering. This talk will give an introduction to an approach called Itembased Collaborative Filtering and explain Mahout's Map/Reduce based implementation of it. Please do indicate on Upcoming http://upcoming.yahoo.com/event/6792156 or on Xing https://www.xing.com/events/apache-hadoop-berlin-october-2010-564265 if you are coming so we can more safely plan capacities. Updates to the event, a brief summary and videos will be posted on http://isabel-drost.de/hadoop JTeam is looking for Java developers and search enthusiasts. Check out their jobs page (http://www.jteam.nl/Jobs/Jobs.html) for more info! As always a big Thank You goes to newthinking store for providing the venue for free for our event. Looking forward to seeing you in Berlin as well, Isabel
Re: how is the Vector format?
On Sun, 5 Sep 2010 Valerio Ceraudo valerio.cera...@gmail.com wrote: ok ok I can run your arffToVector in org.apache.utils.vectors.arff.Driver but i found a bug, it doesn't recognize the attribute REAL, so I changed the arff attributes in NUMERIC and it works,now I have got a iris.arff.MVC file. Any chance you might have some time to file a JIRA issue for that - or maybe even provide a patch that fixes the issue? Isabel
Re: Version compatibility of Mahout 0.4-SNAPSHOT with Hadoop release?
On Thu Peter M. Goldstein peter_m_goldst...@yahoo.com wrote: Yes, my original email should have said 0.20.2+320. Sorry about the typo. You can find that version here: http://archive.cloudera.com/cdh/3/ Or at Debian Squeeze (http://packages.qa.debian.org/h/hadoop.html) or of course directly from the Apache Hadoop project. And it does explicitly say 0.20.2 on the Mahout on Amazon EC2 wiki page. Just for further reference - system requirements for Mahout are tracked on the wiki page named accordingly: https://cwiki.apache.org/confluence/display/MAHOUT/System+Requirements Isabel
Re: ICML / COLT and Mahout
On Wed Danny Leshem dles...@gmail.com wrote: I took a different track, so only had a chance to chat with some of the open-source participants during their poster session. Most of them never heard of Mahout, or only heard of it by name. Would you be interested in introducing Mahout to the ICML/COLT people in a future workshop or in JMLR MLOSS? I am sure the Mahout community would be more then happy to help you proof-read your publication. Isabel
Re: Installing Mahout
On Thu tammuz rasil...@gmail.com wrote: Well this is what I note during the installation: Running org.apache.mahout.clustering.TestPrintableInterface Tests run: 22, Failures: 19, Errors: 0, Skipped: 0, Time elapsed: 0.427 sec FAILURE! In case of failing tests you should be able to see more information when looking into $module-name/target/surefire-reports/org.apache.mahout.clustering.TestPrintableInterface.txt The content of that file should help diagnose the problem for us as well. Isabel
Re: Getting started with mahout
On Tue Jeff Eastman j...@windwardsolutions.com wrote: that you can browse for historical purposes. As a way of getting started, I'd suggest learning to run some of the examples. If one of our algorithms seems most interesting, jump into its unit tests and begin to explore the code. Some more information on how to get started contributing to Mahout: https://cwiki.apache.org/MAHOUT/howtocontribute.html Isabel
Re: Gephi graph visualization
On Thu Grant Ingersoll gsing...@apache.org wrote: Stefan G. gave a nice demo of this at Buzzwords (http://gephi.org/) and I tried it out on the plane ride home and it seems like it could be used as a nice way to visualize clusters. It can import a CSV file that is essentially a big matrix of nodes and edges. I think it wouldn't be too hard to have a job that converts the clusters into this CSV format for easy loading. +1 I used Gephi for graph visualisation earlier this year - it seems capable of handling reasonably sized graphs and makes understanding their structure really easy. Being an interactive tool it's also helpful in exploring your linked data. Would be great to be able to say import the result of our clustering jobs into gephi. Isabel