Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair

2018-07-19 Thread Isabel Drost-Fromm




On 19/07/18 10:24, Sebastian Schelter wrote:

Congrats!


+1

Looking forward to hearing Andrew's voice on one of the upcoming board 
calls - please do feel invited to join as a new PMC chair.



Isabel


Re: New logo

2017-05-06 Thread Isabel Drost-Fromm
The green logo was the very first design iteration before iirc Robin came up 
with the yellow one. The should be like five TShirts world wide with the old 
logo printed in 2009.


Am 1. Mai 2017 20:41:43 MESZ schrieb Trevor Grant :
>Thanks Scott,
>
>You are correct- in fact we're going even further now, that you can do
>native optimization regardless of the architecture with native-solvers.
>
>Do you or anyone more familiar with the history of the website know
>anything about the origins/uses of this:
>https://mahout.apache.org/images/Mahout-logo-245x300.png
>It seems to be a green mahout logo.
>
>Also Scott, or anyone lurking who may be able to help.  As part of the
>website reboot I've included a "history" page and would really
>apppreciate
>some help capturing that from first person sources if possible. Ive put
>in
>some headers but those are only directional:
>
>https://github.com/rawkintrevo/mahout/blob/website/website/front/community/history.md
>
>
>
>Trevor Grant
>Data Scientist
>https://github.com/rawkintrevo
>http://stackexchange.com/users/3002022/rawkintrevo
>http://trevorgrant.org
>
>*"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
>On Mon, May 1, 2017 at 11:18 AM, scott cote 
>wrote:
>
>> Trevor et al:
>>
>> Some ideas to spur you on (and related points):
>>
>> Mahout is no longer a grab bag of algorithms and routines, but a math
>> language right?  You don’t care about the under the cover
>implementation.
>> Today its Spark with alternative implementations in Flink, etc ….
>>
>> Don’t know if that is the long term goal still  - haven’t kept up -
>but it
>> seems like you are insulating yourself from the underlying
>technology.
>>
>> Math is a universal language.  Right?
>>
>> Tower of Babel is coming to mind ….
>>
>> SCott
>>
>> > On Apr 27, 2017, at 10:27 PM, Trevor Grant
>
>> wrote:
>> >
>> > It also bugs me when I can't suggest any alternatives, yet don't
>like the
>> > ones in front of me...
>> >
>> > I became aware of a symbol a week or so ago, and it keeps coming
>back to
>> > me.
>> >
>> > The Enso.
>> > https://en.wikipedia.org/wiki/Ens%C5%8D
>> >
>> > Things I like about it:
>> > (all from wikipedia, since the only thing I knew about this symbol
>prior
>> is
>> > that someone I met had a tattoo of it).
>> > It represents (among a few other things) enlightenment.
>> > ^^ This resonated with the 'alternate definition of mahout' from
>Hebrew-
>> > which may be something akin to essence or truth.
>> >
>> > It is a circle- which plays to the Samsara theme.
>> >
>> > It is very expressive, a simple one or two brush stroke circle
>which
>> > symbolizes several large concepts and things about the creator,
>> expressive
>> > like our DSL (I feel gross comparing such a symbol to a Scala DSL,
>but
>> I'm
>> > spit balling here, please forgive me- I am not so expressive).
>> >
>> > "Once the *ensō* is drawn, one does not change it. It evidences the
>> > character of its creator and the context of its creation in a
>brief,
>> > contiguous period of time." Which reminds me of the DRMs
>> >
>> > In closed form it represents something akin to Plato's perfection-
>which
>> a
>> > little more wiki surfing tells me is the idea that no one can
>create a
>> > perfect circle because a circle is a collection of infinite points
>and
>> how
>> > could ever be sure that you have arranged each one properly, yet
>such
>> > things must exist, or what blueprint would a creator of circles be
>> striving
>> > for.  This, by-the-by reminds me of stochastic approaches to
>solving
>> > problems, and really statistics / "machine-learning" in general, in
>that
>> we
>> > can't find perfect solutions, yet we believe solutions exist and
>serve as
>> > our blueprint.
>> >
>> > Finally, I like that it is simple.
>> >
>> > Things I don't like about it:
>> > Lucent Technologies used it back in the 90s, however they used a
>very
>> > specific red one, and this isn't a deal breaker for me.
>> >
>> > Other thoughts:
>> > Based on the tattoo I saw- one could make an Enso using old mahout
>color
>> > palatte if one were to dab their brush in the appropriate colors.
>This
>> > could also be represented in any single color. (Not sure what that
>does
>> to
>> > our TM, is it ok if we just keep slapping TMs on the side of it? If
>that
>> is
>> > the case is there any reason we must have a single Enso?)
>> >
>> > So there is something to throw in the pot that is a little more
>grown up
>> > than my runner up favorites (honey badger, blueman riding bomb
>waving
>> > cowboy hat, blueman riding lighting bolt into a squirrel covered in
>> water,
>> > etc).
>> >
>> > Again, only know what wiki has told me, so if anyone is more
>familiar
>> with
>> > this symbol (like was it used as a logo by some horrible dictator
>which
>> > carried out terrible attrocities?) or just general comments.
>> > tg
>> >
>> >
>> >
>> > Trevor Grant
>> > Data 

RE: Welcome our GSoC Student Aditya Sarma

2017-05-05 Thread Isabel Drost-Fromm
Hi Aditya,

Welcome. Great to have you here. 

Isabel

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Re: Marketing

2017-03-29 Thread Isabel Drost-Fromm
One more thing: what was really helpful in spreading the word in the early days 
was collecting real user stories: who achieved what with Mahout. Could be 
helpful for the new multi backend version as well. Imagine quotes like "we've 
successfully used Mahout on $insertBackendHere to solve 
$insertSuperDuperCoolUsecaseHere in no time" says $name, CTO of $hotNewStartup 
in an article about the project.

Warning: this is tedious work, involves monitoring Twitter, having a Google 
alert for the name and talking to any number of people over long periods of 
time to nudge them go public with their potentially confidential story.


Am 30. März 2017 01:03:31 MESZ schrieb Isabel Drost-Fromm <isa...@apache.org>:
>That is an awesome second interpretation.
>
>Having voted on the original name I'm 100% biased so take my opinion
>with a huge grain of salt: on the one hand I think name changes are
>over rated (anyone remember ethereal?), on the other hand IMHO Mahout
>is a fairly strong brand representing machine learning at scale.
>
>Maybe a combination of any of a new logo, design, documentation,
>release that drops the zero in "0.x.y", a press release for that
>release that Sally can help you with, a new front page that publishes
>the new focus of development, maybe a few snippets on that shift in
>focus that editors can use, dropping deprecated code would already go a
>long way... Just some random ideas.
>
>Isabel
>
>
>Am 25. März 2017 03:21:50 MEZ schrieb Ted Dunning
><ted.dunn...@gmail.com>:
>>On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel <p...@occamsmachete.com>
>>wrote:
>>
>>> maybe we should drop the name Mahout altogether.
>>
>>
>>I have been told that there is a cool secondary interpretation of
>>Mahout as
>>well.
>>
>>I think that the Hebrew word is pronounced roughly like Mahout.
>>
>>מַהוּת
>>
>>The cool thing is that this word means "essence" or possibly "truth".
>>So
>>regardless of the guy riding the elephant, Mahout still has something
>>to be
>>said for it.
>>
>>(I have no Hebrew, btw)
>>(real speakers may want to comment here)
>
>-- 
>Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>gesendet.

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Re: Marketing

2017-03-29 Thread Isabel Drost-Fromm
That is an awesome second interpretation.

Having voted on the original name I'm 100% biased so take my opinion with a 
huge grain of salt: on the one hand I think name changes are over rated (anyone 
remember ethereal?), on the other hand IMHO Mahout is a fairly strong brand 
representing machine learning at scale.

Maybe a combination of any of a new logo, design, documentation, release that 
drops the zero in "0.x.y", a press release for that release that Sally can help 
you with, a new front page that publishes the new focus of development, maybe a 
few snippets on that shift in focus that editors can use, dropping deprecated 
code would already go a long way... Just some random ideas.

Isabel


Am 25. März 2017 03:21:50 MEZ schrieb Ted Dunning :
>On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel 
>wrote:
>
>> maybe we should drop the name Mahout altogether.
>
>
>I have been told that there is a cool secondary interpretation of
>Mahout as
>well.
>
>I think that the Hebrew word is pronounced roughly like Mahout.
>
>מַהוּת
>
>The cool thing is that this word means "essence" or possibly "truth".
>So
>regardless of the guy riding the elephant, Mahout still has something
>to be
>said for it.
>
>(I have no Hebrew, btw)
>(real speakers may want to comment here)

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-02-01 Thread Isabel Drost
On Tue, Jan 31, 2017 at 04:06:36PM -0800, Dmitriy Lyubimov wrote:
> Except for a several applied
> off-the-shelves, Mahout has not (hopefully just yet) developed a
> comprehensive set of things to use.

Do you think there would be value in having that? Funding aside, would now be a
good time to develop that or do you think Samsara needs more work before
starting to work on that?

If there's value/ good timing: Do you think it would be possible to mentor
downstream users to help get this done? And a question to those still reading
this list: Would you be interested an able (time-wise) to help out here?


> The off-the-shelves currently are cross-occurrence recommendations (which
> still require real time serving component taken from elsewhere), svd-pca,
> some algebra, and Naive/complement Bayes at scale.
> 
> Most of the bigger companies i worked for never deal with completely the
> off-the-shelf open source solutions. It always requires more understanding
> of their problem. (E.g., much as COO recommender is wonderful, i don't
> think Netflix would entertain taking Mahout's COO run on it verbatim).

Makes total sense to me. Would be possible to build a base system that performs
ok and can be extended such that is performs fantastically with a bit of extra
secret sauce?


> It is quite common that companies invest in their own specific
> understanding of their problem and requirements and a specific solution to
> their problem through iterative experimentation with different
> methodologies, most of which are either new-ish enough or proprietary
> enough that public solution does not exist.

While that does make a lot of sense, what I'm asking myself over and over is
this: Back when I was more active on this list there was a pattern in the
questions being asked. Often people were looking for recommenders, fraud
detection, event detection. Is there still such a pattern? If so it would be
interesting to think which of those problems are wide spread enough that
offering a standard package integrated from data ingestion to prediction would
make sense.


> That latter case was pretty much motivation for Samsara. If you are a
> practitioner solving numerical problems thru experimentation cycle, Mahout
> is much more useful than any of the off-the-shelf collections.

+1 This is also why I think focussing on Samsara and focussing on making that
stable and scalable makes a lot of sense.

The reason why I dug out this old thread comes from a slightly different angle:
We seem to have a solid base. But it's only really useful for a limited set of
experts. It will be hard to draw new contributors and committers from that set
of users (it will IMHO even be hard to find many users who are that skilled).
What I'm asking myself is if we should and can do something to make Mahout
useful for those who don't have that background.



> > perspective? If so, would there be interest among the Mahout committers to
> > help
> > users publicly create docs/examples/modules to support these use cases?
> >
> 
> yes

Where do we start? ;)


Isabel




Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Isabel Drost-Fromm

Hi,

On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
> and we're thinking about just how many pre-built algorithms we
> should include in the library versus working on performance behind the
> scenes.

To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really like it's
approach to be able to run on more than one execution engine.

To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?


Isabel



Berlin Buzzwords 2014: CfP is open

2014-01-23 Thread Isabel Drost-Fromm
I'm super happy to announce that the call for submissions for Berlin
Buzzwords 2013 is open. For those who don't know the conference - in
my absolutely objective opinion the event is the most exciting
conference on storing, processing and searching large amounts of
digital data for engineers.

The 5th edition of Berlin Buzzwords will take place on May 25-28,
2014 at Kulturbrauerei Berlin.

Berlin Buzzwords is looking for speakers who submit talks on the
following topics:

* Information Retrieval / Search i.e. Lucene, Solr, katta, ElasticSearch or
comparable solutions

* NoSQL and SQL i.e. CouchDB, MongoDB, Jackrabbit, Hbase and others

* Large Data Processing i.e. Hadoop itself, MapReduce, Cascading, Pig,
Spark and friends

Closely related topics not explicity listed above are welcome as well.

The Call for Submissions will be open until February 9! Be part of
Berlin Buzzwords and submit your session idea. Please register here:
http://berlinbuzzwords.de/call-submissions.

Looking forward to lots of interesting proposals - and looking forward to
meeting all of you in Berlin later this year (did I mention that Berlin
rocks in summer?)


Isabel

PS: As always, any help with spreading the word is highly welcome.

PS2: One final hint - even though speakers of course get a complimentary
conference pass make sure to still check out our ticket page in
particular if you'd like to bring your children to the conference - we
do provide child day care on a donation basis but need your registration
for capacity planning: http://berlinbuzzwords.de/tickets



Re: java.lang.NoClassDefFoundError: com/google/common/base/Preconditions

2013-11-29 Thread Isabel Drost-Fromm
On Thu, 28 Nov 2013 13:24:26 +0530
Tharindu Rusira tharindurus...@gmail.com wrote:

 Yes that's the exact issue Suneel, it was a careless mistake while
 adding projects to Eclipse that I missed those .jars.

When changing Mahout code make sure to either run

mvn eclipse:eclipse before importing the project into your workspace or
enable maven support in Eclipse.

When integrating Mahout into your project it's best to use Maven, Ivy,
Gradle or some other build system that supports resolving transitive
dependencies automatically to avoid these issues.


Isabel


Re: Mahout fpg

2013-11-29 Thread Isabel Drost-Fromm
On Fri, 22 Nov 2013 17:55:13 +0800
Jason Lee wua...@gmail.com wrote:

 I noticed lots of algorithms implementations has deprecated in Mahout
 0.8 and removed in 0.9,  but no reasons or comments been marked. Can
 i ask why?

As Suneel mentioned earlier: Before removing these algorithms we asked
on the user list for input on what users really needed.

If you need anything that was marked deprecated you are welcome to step
up, provide patches and improvements to re-vive implementations that
are currently in the danger of being deleted soon.


 Btw, Mahout API is a little lack javadoc comments, every contributors
 of Mahout should has the responsibility to add more javadoc comments
 to the java file they created.

Not an excuse but maybe a step forward: If you find classes and
packages lacking documentation that you know well (or are in the
process of getting to know well) we'd be grateful if you could provide
the missing documentation as a patch to the code base*. 


Isabel

* Also in my experience documentation patches tend to be easier to get
  approval for from your employer than donating whole new
  implementations that you have developed internally...


Re: Could OpenNLP use Mahout for classification?

2013-04-10 Thread Isabel Drost-Fromm

Hi Jörn,


On Tuesday, April 09, 2013 10:12:47 PM Jörn Kottmann wrote:
 Logistic Regression (is that similar to our maxent ?)
 Online Passive Aggressive
 HMM

 The datasets we are training OpenNLP are usually rather small and can
 easily be processed with a single CPU, does Mahout support training on
 small scale datasets as well?

In particular the Logistic Regression and HMM stuff should be well suitable 
even for smaller data sets. You can find the JavaDoc for each there:

https://builds.apache.org//job/Mahout-
Quality/javadoc/org/apache/mahout/classifier/sgd/package-summary.html


and here:

https://builds.apache.org//job/Mahout-
Quality/javadoc/org/apache/mahout/classifier/sequencelearning/hmm/package-
summary.html

Both have versions that can run standalone on a single box - they may come 
with Hadoop as a dependency, mainly for serializing vectors and matrices to 
disk but not for computation distribution.


Isabel


[OT] Any Mahout ppl interested in meeting for a drink or two in Sydney on Friday?

2012-09-22 Thread Isabel Drost
Hi,

on Friday, Sep 28th I'll be meeting with a few other Apache ppl in Sydney for 
dinner and a drink or two. Let me know if you are interested in joining us.


Isabel


Re: NetflixRecommender Data

2012-06-02 Thread Isabel Drost
On 01.06.2012 Sean Owen wrote:
 It is no longer officially available because of the lawsuit against
 Netflix.

Hmm - when thinking of our tidying up work: Should we than remove the example?


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Mahout + BigDataR Linux

2012-06-01 Thread Isabel Drost
On Fri, Jun 1, 2012 at 7:40 AM, Nicholas Kolegraff
nickkolegr...@gmail.com wrote:
 I'm on board with this.
 This has been a common suggestion from more advanced users (and makes
 sense). I am exploring how to incorporate packages into the build process, I 
 don't
 want to commit to anything, yet, but plan to take a much deeper dive mid
 July.

Some information that might help you:

The Debian new maintainers guide:
http://www.debian.org/doc/manuals/maint-guide/index.en.html

The Debian wiki on how to package Java projects including information
on how to package maven-built software:
http://wiki.debian.org/Java/Packaging

There also is a mailing list for more discussion on problems and
questions related to packaging java projects into Debian:
http://lists.debian.org/debian-java/

One word of warning: You might run into one issue or another as Java
projects usually aren't build in a way that's particularly amenable to
turn them into distribution packages right away. However it should
help that Mahout is maven built and relies on standard libraries only.


Cheers,
Isabel


Re: Mahout + BigDataR Linux

2012-05-31 Thread Isabel Drost
On 03.05.2012 Ted Dunning wrote:
 As a point of strategy, wouldn't have better to just build a debian package
 repository and a script for installing packages?

Or go even one step further and provide real Debian packages?


Isabel


signature.asc
Description: This is a digitally signed message part.


Berlin Buzzwords program is online

2012-04-26 Thread Isabel Drost
This is to announce the Berlin Buzzwords program. The Program Committee has 
completed reviewing all submissions and set up the schedule containing a great 
lineup of speakers for this years Berlin Buzzwords program. Among the speakers 
we have Leslie Hawthorn (Red Hat), Alex Lloyd (Google), Michael Busch (Twitter) 
as well as Nicolas Spiegelberg (Facebook). Checkout our program at 
http://berlinbuzzwords.de/program/session-schedule 

Berlin Buzzwords standard conference tickets are still available. Note that we 
also offer a special rate for groups of 5 and more attendees with a 15% 
discount 
off the standard ticket price. 

“Berlin Buzzwords is by far one of the best conferences around if you care 
about 
search, distributed systems, and NoSQL...” says Shay Banon, founder of 
ElasticSearch. 

Berlin Buzzwords will take place June 4th and 5th 2012 at Urania Berlin 
(http://www.uraniaberlin.de). The 3rd edition of the conference for developers 
and users of open source projects, again focuses on everything related to 
scalable search, data-analysis in the cloud and NoSQL-databases. We are 
bringing 
together developers, scientists, and analysts working on innovative 
technologies 
for storing, analysing and searching today's massive amounts of digital data. 

Berlin Buzzwords is organised by newthinking communications GmbH in 
collaboration with Isabel Drost (Member of the Apache Software Foundation, PMC 
member Apache community development and co-founder of Apache Mahout), Jan 
Lehnardt (PMC member Apache CouchDB) and Simon Willnauer (PMC member Apache 
Lucene). 

More information including speaker interviews, ticket sales, press information 
as well as meet me at bbuzz buttons are available on the official website: 
http://berlinbuzzwords.de/ 

Looking forward to meeting you in June,

Isabel

PS: Did I mention that Berlin is all beautiful in Summer?


signature.asc
Description: This is a digitally signed message part.


Re: Slides for Talk on BedCon

2012-04-26 Thread Isabel Drost
On 31.03.2012 Manuel Blechschmidt wrote:
 you can find my slides for my presentation based on Mahout and Java EE for
 the Berlin Expert Days 2012 here:
 
 https://github.com/ManuelB/facebook-recommender-demo/raw/master/docs/Talk-B
 edCon-Berlin-2012.pdf

Thanks for sharing - I added them here:

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Mahout at Twitter

2012-04-26 Thread Isabel Drost
On 05.04.2012 Jake Mannix wrote:
 changing subject line to split off from Sean's Myrrix discussion
 
 On Thu, Apr 5, 2012 at 1:28 AM, Dan Brickley dan...@danbri.org wrote:
  On 5 April 2012 00:18, Jake Mannix jake.man...@gmail.com wrote:
   +1 to everything Ted said.
   
As an added point, while we're on the subject of corporate
involvement,
   
   forks, and extensions of Mahout, now is as good a time as any to
   announce that I (and my teammate Andy Schlaikjer) are maintaining a
   official Twitter fork of Mahout (hosted and worked on entirely in
   the open on GitHub: http://github.com/twitter/mahout ), which we'll be
   making
  
  patches
  
   off of to submit back to Apache trunk on a periodic basis.
  
  [...]
  
  Jake, care to add some appropriate brief mention to
  https://cwiki.apache.org/MAHOUT/powered-by-mahout.html ? Knowing that
  Twitter make serious use of Mahout adds a lot of credibility to the
  project, and I'm sure would be enough additional information to tip
  various others over into more seriously considering adoption. I
  thought about linking your previous email mention of it, ... but you'd
  know better what to say and/or link to.
 
 done!

Awesome - Thanks!

It's always great to see more entries to this list.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-19 Thread Isabel Drost
On 05.04.2012 Sean Owen wrote:
 On Wed, Apr 4, 2012 at 11:43 PM, Darren Govoni dar...@ontrenet.com wrote:
  The short answer is that they have to open their source. So anything
  they do to the original code is readily available to all.

 Not with the Apache license... it's not copyleft. The GNU license
 might require this.

AFAIK and IANAL: Even neither the GNU General Public License nor the GNU Lesser 
General Public License require modifiers to open their source to just anyone. 
Only in case they hand the resulting binary over to someone else those 
modifications need to be given to said person and made available under the same 
original license in an effort to give the receiver of your binary the same 
rights that you initially built your works on. There is no need to make these 
modifications available to the general public - although that might turn out to 
be the most pragmatic solution.


Isabel



signature.asc
Description: This is a digitally signed message part.


Re: Error Running mahout-core-0.5-job.jar

2012-03-28 Thread Isabel Drost
On 22.03.2012 Paritosh Ranjan wrote:
 You can also use HadoopUtil.delete(conf, paths) api or use the -ow
 (override) flag ( if available for that job).

If that flag isn't available for the job you are looking at, that might be a 
good chance to submit a bug report and mark it as suitable for beginners - 
just mark it as MAHOUT_INTRO_CONTRIBUTE  in JIRA.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: How to add classes into mahout-score-0.5-job.jar?

2012-03-28 Thread Isabel Drost
On 22.03.2012 jeanbabyxu wrote:
 From Chapter 6 of Mahout in Action (page 111)
 
 But were you to use your own implementation, you would need to add it and
 any of its dependent classes into he JAR file as well. This can be
 accomplished with
 
 jar uf mahout-core-0.5-job.jar -C [classes directory]
 
 
 My question is : how to find out the directory for the dependent classes?

This description explains how to add your own classes that you have implemented 
to the classpath - e.g. in cases where you want to use your own distance 
implementation rather than those provided with Mahout. I'm not sure this is 
what 
you are looking for. What do you want to accomplish?


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Mahout 0.6 Naive Bayes Accuracy

2012-03-28 Thread Isabel Drost
On 27.03.2012 Dimitri Goldin wrote:
 Having tried Mallets naive bayes implementation we achieved ~95%
 accuracy without having to balance the training-data. Does anybody know
 which implementation detail might cause this or why balance seems
 influence mahouts implementation much more?

Without knowing the Mallet implementation: You describe that you tried using 
two 
tokenizations for your Mahout runs - what are you using when running Mallet?

Which Naive Bayes implementation in Mahout did you use?

Did you also try running with the complementary naive bayes implementation or 
the logistic regression instead?


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: options for finding smallest eigenvectors

2012-03-28 Thread Isabel Drost
On 28.03.2012 Dmitriy Lyubimov wrote:
 Nathan Halko's thesis did detailed comparisons of singular values
 between Mahout's Lanczos and SSVD. You can look up a link to his
 dissertation on this list archive. (or perhaps he mentioned it @dev,
 can't remember on top of my head).

When you find it - could you please add it there:

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks

Maybe add a separate section for scientific publications involving Mahout.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: BedCon Talk: Easy Java EE 6 war with Mahout bundled 0.6

2012-03-15 Thread Isabel Drost
On 20.02.2012 Manuel Blechschmidt wrote:
 I am going to give a talk about setting up mahout in a Java EE environment:
 http://bed-con.org/talks/how-to-build-a-recommender-system-based-on-mda-gwt
 -mahout-and-java-ee/
 
 I created in my eyes the smallest possible demo for a recommender with an
 as easy as possible set up including a small sample of my facebook
 friends.

Sounds like an interesting talk - would you mind sharing the slides afterwards?


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: dataset for recommendations and Hidden markov chains

2012-01-07 Thread Isabel Drost
On 06.01.2012 rahul raghavendhra wrote:
 hi all, Can u suggest me the dataset for hidden markov chains and
 recommendations..

Please check out our wiki (link on mahout.apache.org is called documentation) 
there is a list of datasets as well as several pages on how to start and run 
the 
various algorithms in Mahout.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Mahout Installation without Building From Maven

2012-01-06 Thread Isabel Drost

Just two minor comments:

On 28.12.2011 Lance Norskog wrote:
 In general you are better off using the full source distribution.

Binaries are provided for convienience only - however you should be fine if you 
only want to use the jars and job.jars.


 There are some apps and scripts that can help you and these are not
 packaged into the Maven binary distribution.  Also, 0.5 is an old
 release and there have been a lot of changes since then.

In case you have the cycles to try out trunk - any feedback is highly welcome 
as 
we are in the process of getting a new version released soon.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Generating vectors from custom source

2011-12-21 Thread Isabel Drost
On 16.12.2011 Dale McDiarmid wrote:
 Could someone please share some code from a similar
 requirement to get me started - an example reading a csv file for example.

There is also some documentation on the wiki:

https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
https://cwiki.apache.org/MAHOUT/lda-commandline.html

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Problems reading solr index

2011-12-21 Thread Isabel Drost
On 19.12.2011 Billy Newport wrote:
 Any place to download the current snapshot? I'm firewalled here so I
 can't get Svn access.

Just for reference: https://repository.apache.org/ should have recent snapshots

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: ANN: The Mahout Recommender Plugin 0.5.1 Released

2011-12-21 Thread Isabel Drost
On 20.12.2011 Chee Kin Lim wrote:
 Please see release note at
 http://limcheekin.blogspot.com/2011/12/mahout-recommender-plugin-051-releas
 ed.html

Thanks for integrating the recommender part of Mahout into Grails. Any feedback 
on how to better support integration into 3rd party components is highly 
welcome. On a related note: Do you think it might also makes sense to integrate 
any other parts of Mahout?

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Weighted Naive Bayes Algorithm

2011-12-21 Thread Isabel Drost
On 20.12.2011 Ramprakash Ramamoorthy wrote:
I am using Naive Bayes classifier for my sentiment analysis on
 customer support. But unfortunately I don't have huge annotated data sets
 in the customer support domain.

If your training set is small - why not use e.g. SGD instead of Naive Bayes?


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Map/Reduce for mahout SGD Classification

2011-12-21 Thread Isabel Drost
On 21.12.2011 Ted Dunning wrote:
 On Tue, Dec 20, 2011 at 11:06 PM, selva selvai...@gmail.com wrote:
  When will map/reduce release for mahout SGD Classification?
 
 Probably 0.6
 
   When will mahout 0.6 release ?
 
 Q1 of 2012

Valid for both: If you need the functionality faster - Any helping hand even if 
it just involves testing the patch is welcome.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Austin SIGKDD - Next Meeting Wednesday, December 14, 2011, 7:00 - 8:00 pm

2011-12-14 Thread Isabel Drost
On 14.12.2011 David Boney wrote:
 Sure, we are studying machine learning using  Mahout.  We have started a
 weekly hackers dojo to learn how to implement Hadoop based machine
 learning programs using Mahout. Once the group get some experience using
 Mahout, we are going to focus on projects to add functionality to Mahout.

While you are still in the new-user-trying-to-figure-stuff-out mode - would be 
great if you could point out any documentation that lacks more detail - or 
maybe 
even fix it.

Also instead of adding new functionality it would be great if you could also 
concentrate on better integration and streamlining - I guess you are looking at 
various parts of Mahout right now.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: newbie design question

2011-12-12 Thread Isabel Drost
On 08.12.2011 ajinkya wrote:
 I am struggling in the mountain of tutorials and documentations... need
 some design help.

There are two wiki pages that should help you get started:

https://cwiki.apache.org/confluence/display/MAHOUT/Quickstart (the chapter on 
clustering has some examples)

https://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (gives 
more detailed instructions)


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Frequent itemset mining

2011-12-05 Thread Isabel Drost
On 02.12.2011 Tom Pierce wrote:
 These programs are actually exposed though the main mahout program; if you
 run:
 
 $MAHOUT_HOME/bin/mahout fpg
 
 it will run the Frequent Pattern Growth algorithm (aka frequent itemset
 mining).

Also there is quite some documentation on the wiki:

https://cwiki.apache.org/MAHOUT/parallel-frequent-pattern-mining.html (also 
includes a link to the original research publication).

Isabel



signature.asc
Description: This is a digitally signed message part.


Re: LDA clustering example not working

2011-12-05 Thread Isabel Drost
On 02.12.2011 Chris Grier wrote:
 Caused by: java.io.IOException: Cannot open filename
 /tmp/mahout-work-hadoop/reuters-out-seqdir-sparse-lda/tf-vectors/_logs

Are you providing the correct input directory here? On first sight it seems to 
think that the logs dir contains the tf-vectors.

On a related note: If you are working with LDA - did you try out Jake's new 
implementation? Would be great to get more feedback on that one.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Grant's developerworks article a weekly highlight

2011-12-05 Thread Isabel Drost
On 02.12.2011 Ted Dunning wrote:
 We knew it was a highlight, but IBM seems to agree!
 
 http://www.ibm.com/developerworks/podcast/twodw-110911/

Grant: Congratulations!

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Relevance score - Classification

2011-11-30 Thread Isabel Drost
On 29.11.2011 Faizan(Aroha) wrote:
 In our case, I think we won't be looking much into features
 
 I am moving towards clustering as Tantons's mentioned.

Hmm - what kind of similarity measure are you planning to use for that? What 
makes to items be similar in your use case?

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: LDATopic

2011-11-30 Thread Isabel Drost
On 28.11.2011 bish maten wrote:
 mahout ldatopics -i mahout-work/abc/abc-lda/state-20  -d
 mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0  -dt
 sequencefile  (there were no errors reported and command worked fine with
 following output). Does the output appear ok?

Hmm - this only prints the resulting LDA topics - which command did you use to 
generate them?

Please also note that Jake is currently working on improving our LDA support, 
if 
you are interested in that algorithm it might be interesting for you to look 
into his patch in https://issues.apache.org/jira/browse/MAHOUT-897

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Mahout distribution download

2011-11-30 Thread Isabel Drost
On 28.11.2011 Sean Owen wrote:
 There is no newer distribution, but, you can always check out the very
 latest from Subversion:
 https://cwiki.apache.org/confluence/display/MAHOUT/Version+Control

Also we do publish nightly builds at the Apache Maven-Snapshot repository.

If you would like to help shorten the time it takes until the next release 
please check any open issures tagged as 0.6:

 https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12316364

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Data class taxonomy for machine learning

2011-11-30 Thread Isabel Drost
On 29.11.2011 Ted Dunning wrote:
 I find this taxonomy excessive and over-done.  The distinctions I find
 useful include
 
 - continuous variables
 
 - discrete variables with a known set of values (I call these categorical,
 usually).  This includes ordinal variables since ordering rarely makes a
 lot of difference.
 
 - discrete variables with a large or not well known set of possible values
 (I call these word-like)
 
 - bags or lists of word-like variables (I call these text-like)

What I found useful for explaining which data types to expect::

http://www.cs.uni-
potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf (Slide 
6, unfortunately in German only) 

What seemed more needed was an explanation of different problem settings and 
how 
to tackle them on a very high level:
http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Clustering - Sequence File from Directory

2011-11-30 Thread Isabel Drost
On 30.11.2011 Faizan(Aroha) wrote:
 Would anyone please give any hint?
 
 On Running the following command:
 
 bin/mahout seqdirectory -c UTF-8
 
 -i examples/reuters-extracted/ -o reuters-seqfiles
 
 I'm getting the following error:
 
 
 
 MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
 MAHOUT_LOCAL is set, running locally

That means the job will run locally only, don't expect any jobs to appear in 
your Hadoop jobtracker.


 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/util/ProgramDriver
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.util.ProgramDriver

Did you build all of mahout before running the command line tool?

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: org.apache.maven.plugins:maven-antrun-plugin:1.6:run grief, copy-dependencies and unpack goals not supported by m2e, importing mahout into Eclipse

2011-11-27 Thread Isabel Drost
On 27.11.2011 Mike Spreitzer wrote:
 I presume you are speaking to the issue of the goals not supported.  Note
 that I also have another problem, No marketplace entries found to handle
 maven-antrun-plugin:1.6:run in Eclipse.  Any clues about that one?

To me that sounds like an Eclipse problem (or a problem with the maven 
integration of Eclipse).

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Scalable graph clustering implementation

2011-11-27 Thread Isabel Drost

First of all, your findings sound very interesting - thanks for sharing.


On 26.11.2011 Bae, Jae Hyeon wrote: 
 I want to contribute my implementation to Mahout if it is available and
 allowed. Please let me know how I can follow up.

The easiest starting point would be our how to contribute wiki page:

https://cwiki.apache.org/MAHOUT/how-to-contribute.html

Please keep in mind that contributing whole new packages and algorithms may be 
a 
bit more involved: Apart from the implementation itself including unit tests 
there also is a need for a running example that shows how to use your code and 
at least some sort of quickstart documentation in our wiki.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Sequential Pattern Mining

2011-11-27 Thread Isabel Drost
On 27.11.2011 Nishant Chandra wrote:
 I want to identify rules such as: after acquiring product 1 and then
 product 3, customers have an increased likelihood
 (75%) of purchasing product 4 next.

What is your goal with discovering these rules? Assuming what you want is 
implementing a feature that recommends items to customers they are likely to 
buy:

Did you check the fpgrowth implementation already? Though it does not cover the 
temporal aspect you mention it might still be of value for you as it is capable 
of discovering items that are typically puchased together.

If you would rather personalize your offerings to the preferences of each of 
your customers you might be better of taking a closer look at the collaborative 
filtering implementations of Mahout.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: ItemSimilarity example

2011-11-27 Thread Isabel Drost
On 27.11.2011 bish maten wrote:
 https://cwiki.apache.org/MAHOUT/recommender-documentation.html   has
 following example
 
 // Construct the list of pre-computed correlations
 Collection GenericItemSimilarity.ItemItemSimilarity  correlations = ...;
 
 how is actual construction done in above line  ( correlations = ... ; )

If I understand the abstract above that line of code correct these item 
similarities should capture your personal domain knowledge about items. So how 
to compute them is up to your definition of what makes two items similar.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: mahout command problems

2011-11-27 Thread Isabel Drost
On 27.11.2011 bish maten wrote:
 mvn compile done under subdirectory of mahout-distribution.

Did you also run a mvn package from the mahout root directory?

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Load Dataset and Instances from database

2011-11-25 Thread Isabel Drost
On 24.11.2011 Ted Dunning wrote:
 Actually, one of the most reliable ways to kill a database is to use it as
 input or output for even a small Hadoop cluster.  Having hundreds of
 processes all open connections and read at once is fairly abusive.

Though that does not mean that data cannot by synced to hdfs before being used 
in a map/reduce job. Tools like sqoop help with that.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Load Dataset and Instances from database

2011-11-25 Thread Isabel Drost
On 24.11.2011 Sturm, Martin wrote:
 Since I only want to try it out standalone I was hoping that this was
 possible without any Hadoop stuff. Are there any tutorials or examples
 available that show how to load a Dataset? Because I do not even know what
 files are expected here.. cvs?

You may want to take a look at our quickstart wiki page for that. It explains 
the two examples that show how decision forrests can be used:

https://cwiki.apache.org/MAHOUT/breiman-example.html
https://cwiki.apache.org/MAHOUT/partial-implementation.html

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Relevance score - Classification

2011-11-23 Thread Isabel Drost
On 23.11.2011 Faizan(Aroha) wrote:
 We are working on using Classification as a Search.
 
 I want to compute the relevance score of the output which is generated by
 the Naive Bayes Classifier or some other classifier.
 
 Please give any guideline/hint!

Can you please provide some more background to your use case? Which documents 
do 
you want to search? How is relevance defined in your setting?

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Error in executing mahout kmeans

2011-11-22 Thread Isabel Drost
On 22.11.2011 DIPESH KUMAR SINGH wrote:
 I ran the script and i was getting error regarding missing libraries. The
 error which i got is attached.
 Then i tried executing the commands in the script, command by command, and
 i figured out that error was coming
 in the seq2sparse step. (Prior to this step all the conversions are working
 fine)

There seem to be problems resolving some of the dependencies used - not sure 
why 
though. You did compile the project and in that process created a job jar?


 What i exactly want to try is document clustering, i thought it is better
 to try first with Reuters dataset to get started.
 Are the source files of kmeans (mapper and reducer etc) are there in mahout
 source folder?

Sure, look in the maven module core in the o.a.m.clustering package - all 
kmeans 
related code is in there.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Which input formats to use for classifying WEKA's ARFF format?

2011-11-22 Thread Isabel Drost
On 22.11.2011 HorstItUpright wrote:
 As far as I know, Mahout provides two Bayes algorithms and a Random
 Forest (which is - whyever - called Dicision Forest [which is not
 wrong, I know, but confusing and inconsistent to the Docs I think]).

+ logistic regression (to be found in the sgd package)


 It appears to me (and I've also taken a look into the code) that none
 of these approaches can handle the MVC format (which is the result,
 when parsing the WEKA-ARFF files with the arff-vector converter).

I am not too familiar with the MVC format - is that an intermediate file format 
used by WEKA after parsint ARFF?

 The DF is even more special and requires the UCI format.

DF?


 My question now is: am I overseeing something? Is there a way to
 convert the MVC files on the fly into the proper formats for the
 algorithms?

All algorithms in Mahout are implemented to accept vectors as input format. So 
in order to plug in what ever input format (or database, NoSQL store, which 
ever 
other source for data you might have) all you have to do is provide glue code 
that converts your data into Mahout vectors.

Having said that there is limited support for ARFF in Mahout already. To my 
knowledge that is not feature complete - any help with spotting missing 
features 
and fixing them is highly welcome.


 The Bayes algorithms e.g. are running with the input data, but print a
 lot of strange output to the console during processing and do not give
 any usable results.

Any help with improving logging to make the project easier to use is very 
welcome. Would be great if you could put up a JIRA issue and attach a patch to 
change the code to better match your expectations to get that discussion 
started.


Cheers,
Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Wiki edit request

2011-11-20 Thread Isabel Drost
On 19.11.2011 Lance Norskog wrote:
 Fixed. A: it moved, and B: it's Jenkins now.
 
 On Fri, Nov 18, 2011 at 6:02 PM, Dan Beaulieu
 
 danjacob.beaul...@gmail.comwrote:
  While on the topic, the hudson url is broken... Don't know what it should
  be...

Dan - good catch. Lance, thanks for the fix.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: New User to Mahout

2011-11-18 Thread Isabel Drost
On 12.11.2011 thinkingbigdata wrote:
 I want to understand it fully and want coding to be done in Java. If anyone
 can help me with some examples code that is using Hadoop written examples
 that would be really helpful.

Do you have any machine learning problem you want to get started with in 
particular? Knowing what in particular you are interested in would make it 
easier to answer your question.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Coding format update: Eclipse Lucene conventions

2011-11-18 Thread Isabel Drost
On 14.11.2011 Lance Norskog wrote:
 The Eclipse Lucene conventions are mighty close to what we're using, much
 more so that the Eclipse formatting file on the How To Contribute page.
 So, I've uploaded the Lucene file and changed the link. Eclipse users,
 please try it and see if it's what we want.

Thanks for that contribution.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: mahout for enterprise search project

2011-11-18 Thread Isabel Drost
On 15.11.2011 Burcu Buyukkagnici wrote:
 Where does mahout; Lucene/solr and UIMA framework fit in the following
 scenario?

Some more background on how search and machine learning fit together see also 
http://www.manning.com/ingersoll/

Also at the latest ApacheConNA Grant provided some ideas and insights on what 
types of problems can be solved by a search engine alone. Recordings of all 
talks are online at http://feathercast.org


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Documentation

2011-11-18 Thread Isabel Drost
On 16.11.2011 Ted Dunning wrote:
 One thing that you can do is to point out the problems and even suggest or
 provide some improvements.  Your eyes are still new and thus will see
 problems more clearly than ours.

One thing to note: Most of the Mahout documentation is online in our wiki - 
that 
wiki essentially is public, so if you do have some time left and spot an area 
that you think needs improvement, please do not hesitate to add information.

Also if you spot missing JavaDocs: Providing them is a very simple way to get 
your first patches in.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Austin Hacker Dojo - Big Data Machine Learning

2011-11-18 Thread Isabel Drost
On 17.11.2011 David Boney wrote:
 If at least
 three or four people are interested we can have an organization meeting to
 discuss the group name, finding a location to meet, development
 environment, setting up a web site, and the agenda for the first couple of
 months.

Just a brief comment: Don't know how much interest in Big Data Machine Learning 
there is in Austin - however what did work in Berlin for most meetings I 
started 
in Berlin was to have a more informal gathering at first to figure out how many 
people would be interested - and later on decide on web site, agenda etc. 

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Large Scale Clustering

2011-11-18 Thread Isabel Drost
On 18.11.2011 Grant Ingersoll wrote:
 Might be of interest: Clustering Very Large Multi-dimensional Datasets
 with MapReduce
 
 http://www.cs.cmu.edu/~jclopez/ref/kdd2011-mr-clustering.pdf

Judging from the abstract it looks interesting indeed. Thanks for sharing, 
Grant.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Relevance Prediction Challenge / WSDM 2012 Web Search Click Data Workshop

2011-11-07 Thread Isabel Drost
On 07.11.2011 Pavel Serdyukov wrote:
 We are pleased to announce the launch of the Relevance Prediction
 Challenge, which is a part of the WSDM 2012 Web Search Click Data (WSCD)
 workshop. This challenge provides a unique opportunity to consolidate
 and scrutinize the work from industrial labs on predicting the relevance
 of URLs using user search behavior. It provides a fully anonymized
 dataset shared by Yandex, which has user queries, clicks on URLs and
 their relevance labels.

Any of our Mahout users interested in taking up that challenge? Might be a nice 
project also for people in the academic world working on relevance models based 
on user feedback.


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: does anyone use the row label bindings stuff in Vector / Matrix?

2011-11-02 Thread Isabel Drost
On 02.11.2011 Jake Mannix wrote:
 I'll leave this thread open until after work tonight (8 hrs or so from
 now), and if I don't hear any vociferous complaints or reasoned thoughts on
 why this is crazy, I'll chop 'em.

+1 for the cleanup, however if you are leaving the thread open for that 
purpose, 
you might want to at least wait a day until people in all time zones had a 
chance to read it.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Production use cases of Mahout

2011-11-01 Thread Isabel Drost
On 01.11.2011 Josh Patterson wrote:
 There's a few, check out: http://www.hadoopworld.com/agenda/
 
 The bit.ly folks always have something interesting to show.
 
 The WibiData guys are doing some interesting things with their product
 and recommendation.

Any chance that slides/videos of the talks are going to be made public after 
the 
event? Would love to link to them from the Mahout wiki.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Exception in thread main org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file _b.fnm

2011-10-24 Thread Isabel Drost
On 20.10.2011 OldSkoolMark wrote:
 Exception in thread “main” org.apache.lucene.index.CorruptIndexException:
 unrecognized format -3 in file “_b.fnm”

Not having much experience with Lucene this looks like you are trying to read 
the index with Lucene in a version that is older than the one the index was 
created with?


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Looking for someone with experience integrating content-based approaches in Mahout

2011-10-24 Thread Isabel Drost
On 20.10.2011 Ted Dunning wrote:
 THere is also the j...@apache.org mailing list which is less focussed but
 might hit some folks with the right expertise that this list does not.

And there is https://cwiki.apache.org/MAHOUT/professional-support.html which 
lists companies and people that declared themselves as willing and capable of 
helping Mahout users.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Request for Assistance.

2011-10-05 Thread Isabel Drost

First of all welcome also from my side.

On 05.10.2011 Apurv Verma wrote:
  I am interested in becoming a contributor to Mahout.

Actually we have a How to contribute page on our wiki that might help you:

https://cwiki.apache.org/MAHOUT/how-to-contribute.html

I guess the general take away is to start using Mahout for your own projects. 
As 
with any software you use sooner or later you will find stuff that bothers you: 
Missing documentation, extensions you need to make here and there, sutle bugs. 


  But unfortunately I have not had any course in Machine Learning still. I am
  having a course in Artificial Intelligence this semester.

While it is certainly a great help to have some machine learning background, 
you 
do not need a PhD to start contributing to Mahout. Any infrastructure 
improvements that do not change the inner algorithms but make it easier to 
integrate Mahout and re-use it are highly welcome.


 I am also *not* conversant with hadoop and mapreduce though I have heard of
 it and have long wanted to learn it. Can someone please guide me (mentor
 informally) so that I may get a sense and direction and I am able to
 develop the skills set required to contribute to this project within the
 next 6 months.

You have taken a very good first step by contacting the mailing list. Try to 
figure out an area that you would like to use Mahout for, start working in that 
direction, if you come across any questions that cannot be answered by a 
trivial 
search in the mailing list archives don't be shy to ask on list. When getting 
more proficient answer questions other new-comers may have, start reviewing 
patches and maybe even contribute your own improvements.

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Applying DataMining on Network Packets

2011-10-04 Thread Isabel Drost
On 04.10.2011 Sarath P R wrote:
 I am monitoring packet flow in a Network Interface .  Now i want to make
 some predictions.

What kind of prediction do you want to make?


 Actually i am not sure about what algorithm i should use
 and what kind of predictions i may need. I just want to know is it possible
 to classify network packets using Mahout Classification algorithm. Can
 anyone make some comments.

The classification algorithms of Mahout are based on the idea of classifying 
items that have to be represented as multidimensional vectors and as a result 
are not bound to be used for just one domain.

Put more simply: First think of what kinds of predictions you want to make. 
Then 
think of features that contain information on which prediction is more likely. 
Code these features as vectors and continue from there.

A really nice explanation of this concept is explained in the Mahout in Action 
book. You can also take a quick look at the following slides for a general 
outline:


http://www.user.tu-berlin.de/konrad.rieck/pubs.html


signature.asc
Description: This is a digitally signed message part.


Re: Mahout testimonials

2011-10-03 Thread Isabel Drost
On 29.09.2011 Dan Brickley wrote:
 For what it's worth, we used Mahout in the NoTube EU project, and it
 saved a lot of time (and a brain transplant).  I should blog this. The
 only piece we've used heavily in our apps
 (http://vimeo.com/user3487770 http://notube.tv/ ) [...]
 
 One nice thing about this community, is that Mahout is not
 over-marketed. If the nature or scale of your problem better suits
 other tools, the Mahout folk will tell you so.

Thanks for the really nice comment. I've added you to our powered-by wiki page 
in the powered-by section - feel free to add any additional content as you see 
fit.

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Bayes/CBayes classification on a non-existing feature

2011-10-03 Thread Isabel Drost
On 29.09.2011 André-Philippe Paquet wrote:
 After checking in the CBayesAlgorithm class, I made my own subclass and
 overrided the featureWeight function to return 0 if the weight of the
 feature in the curent label is 0 instead of returning the theta normalized
 weight. It fixed the problem in my case.

 Should I fill an issue?

Yes, absolutely. Your fix sounds like a nice starting point. 

Robin, in a second iteration, should we allow users to plug in their own 
strategies for weighting so far unseen features, or can we come up with one 
that 
works for all most common cases?

Isabel


signature.asc
Description: This is a digitally signed message part.


Fwd: ApacheCon Vancouver Meetups, and other chances for your project to get involved

2011-10-03 Thread Isabel Drost
On 28.09.2011 Nick Burch wrote:
 If you're interested in hosting a Meetup, please list the idea on the 
 Meetups wiki[2]:
  http://wiki.apache.org/apachecon/ApacheMeetupsNa11
  If you see one there you like the look of, bump up the interested count. 
 Once we know there's enough takers, we'll schedule the meetup and help get 
 the word out!
 Also, if you think that a company in your project area might be willing to 
 buy some beer for your meetup, please ask them to drop Delia 
 deliafr...@gmail.com an email and she'll help them get that sorted :)

Any Mahout people (in addition to Grant Ingersoll, Shannon Quinn and myself) 
planning to attend Apache Con NA? 



 In terms of other chances to get together or spread the word about your 
 project, there are a few other options. We're still seeking speakers for 
 the Fast Feather Track, which hosts 20 minute talks about new projects, 
 ideas and features. If there's something new in your area, sign up and let 
 everyone know about it! Signup is here[3]:
https://docs.google.com/spreadsheet/viewform?hl=en_GBformkey=dDR5ZEN0amFzZGVGdHVnQWpuSWM0bGc6MQ#gid=0

If you are a happy Mahout user and are planning to attend Apache Con - why not 
put in a short presentation on your Mahout use case? I'd love to learn more on 
what people are working on.

Isabel


signature.asc
Description: This is a digitally signed message part.


32 Days left to Berlin Buzzwords 2011

2011-05-05 Thread Isabel Drost
hey folks,

BerlinBuzzwords 2011 is close only 32 days left until the big Search,
Store and Scale opensource crowd is gathering
in Berlin on June 6th/7th.

The conference again focuses on the topics search,
data analysis and NoSQL. It is to take place on June 6/7th 2011 in Berlin.

We are looking forward to two awesome keynote speakers who shaped the world of
open source data analysis: Doug Cutting, founder of Apache Lucene and
Hadoop) as
well as Ted Dunning (Chief Application Architect at MapR Technologies
and active
developer at Apache Hadoop and Mahout).

We are amazed by the amount and quality of the talk submissions we
got. As a result
this year we have added one more track to the main conference. If you haven't
done so already, make sure to book your ticket now - early bird tickets are
already sold out since April 7th and there might not be many tickets left.

As we would like to give visitors of our main conference a reason to stay in
town for the whole week, we have been talking to local co-working spaces and
companies asking them for free space and WiFi to host Hackathons right after the
main conference - that is on June 8th through 10th.

If you would like to gather with fellow developers and users of your project,
fix bugs together, hack on new features or give users a hands-on introduction to
your tools, please submit your workshop proposal to our wiki:

http://berlinbuzzwords.de/node/428

Please note that slots are assigned on a first come first serve basis. We are
doing our best to get you connected, however space is limited.

The deal is simple: We get you in touch with a conference room provider. Your
event gets promoted in our schedule. Co-Ordination however is completely up to
you: Make sure to provide an interesting abstract, provide a Hackathon
registration area - see the Barcamp page for a good example:

http://berlinbuzzwords.de/wiki/barcamp

Attending Hackathons requires a Berlin Buzzwords ticket and (then free)
registration at the Hackathon in question.

Hope I see you all around in Berlin,


Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Recommended reading

2011-03-31 Thread Isabel Drost
On Mon, 28 Mar 11 Dan Brickley wrote:
 I've collected up much of the text from this mail thread and added it
 to the wiki at
 https://cwiki.apache.org/confluence/display/MAHOUT/Reference+Reading
 
 I've added links where I could find them, wikified the voice a little
 (downplayed opinions and some detail), but otherwise the text is from
 this thread.
 
 The page currently is a bit awkward since I appended a large body of
 text to a pre-existing small entry, but it seemed better than adding
 a new page. But I'd rather circulate it as-is now than leave this on
 a 'someday pile', so ... there you go. Hope it's useful and that
 others are in the mood to jump in and polish / improve the page.

Thanks so much for going to the effort of adding this information to
the wiki page - sure it needs some polish, however it's nice to see
some of the wisdom commonly found on the mailing list transferred over
to the wiki.

Isabel


BerlinBuzzwords 2011 Early Bird Ticket Period ends on April 7th.

2011-03-25 Thread Isabel Drost

Hey folks,

just a short notice for those who haven't noticed we have only a
limited amount of Early-Bird tickets left and the Early-Bird period is
ends on April 7th. If you want to get one of the 30 remaining tickets
go and get one now here: http://berlinbuzzwords.de/content/tickets

While we are still working on the schedule and selecting speakers we
didn't send out any reject mail yet. So if you have submitted a talk
for BerlinBuzzwords 2011 you don't need to get a Early-Bird ticket
now. All potential speakers will be eligible for Early-Bird discount
even after April 7th.


regards,

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Automatically extracted Mahout FAQs

2011-02-23 Thread Isabel Drost
On Wed, 23 Feb 11 Sean Owen wrote:

 Nice, very interesting to see and read!

Very interesting indeed. Wondering whether creating a Top 10 of the
most frequently asked questions could be created that way as well.

Isabel


Re: Apache Mahout Hackathon - Berlin - Feb 2011

2011-02-18 Thread Isabel Drost
On Tue, 14 Dec 10 Isabel Drost wrote:
 early 2011 - on February 19th/20th to be more precise - the first
 Apache Mahout Hackathon is scheduled to take place at c-base in
 Berlin.

Just a brief reminder - that is this weekend.

We are going to start with a brief barcamp-like brainstorming session to
find out what people actually want to work on during the course of the
weekend.

After that participants are welcome to join break-out sessions or work
on their own projects. Please don't forget to bring your own ideas.

Please remember to bring your own equipment. There is a bar, so no need
to bring drinks. There are several restaurants near by, so don't worry
about not getting anything to eat ;) http://tinyurl.com/6d9lc9z


Isabel


Re: Two learning competitions that might be of interest for Mahout

2011-02-15 Thread Isabel Drost
On Fri, 11 Feb 11 Markus Weimer wrote:
 go for it! I'd do it myself but the rules we wrote prohibit me from
 doing so ;-)

I am pretty sure these rules only forbid you entering and trying to
win the competition - can't imagine that you are forbidden to run
Mahout against the competition data, and maybe publish the results
after the contest is over ;)

Isabel



Two learning competitions that might be of interest for Mahout

2011-02-11 Thread Isabel Drost

http://www.kdd.org/kdd2011/kddcup.shtml
 KDD-Cup 2011: Recommending Music Items based on the Yahoo! Music
 Dataset We challenge participants to identify user tastes in music by
 analyzing real ratings of Yahoo! Music anonymized users. The dataset
 represents a snapshot of the community's preferences for various
 musical items.

http://www.heritagehealthprize.com/competition.php
 The goal of the prize is to develop a predictive algorithm that can
 identify patients who will be admitted to the hospital within the
 next year, using historical claims data.

Isabel


CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Isabel Drost
This is to announce the Berlin Buzzwords 2011. The second edition of the 
successful conference on scalable and open search, data processing and data 
storage in Germany, taking place in Berlin.

Call for Presentations Berlin Buzzwords
   http://berlinbuzzwords.de
  Berlin Buzzwords 2011 - Search, Store, Scale
6/7 June 2011

The event will comprise presentations on scalable data processing. We invite 
you 
to submit talks on the topics:

   * IR / Search - Lucene, Solr, katta or comparable solutions
   * NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
   * Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives
   * Closely related topics not explicitly listed above are welcome. We are
 looking for presentations on the implementation of the systems themselves,
 real world applications and case studies.

Important Dates (all dates in GMT +2)
   * Submission deadline: March 1st 2011, 23:59 MEZ
   * Notification of accepted speakers: March 22th, 2011, MEZ.
   * Publication of final schedule: April 5th, 2011.
   * Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to 
practice. We are looking for real world use cases, background on the 
architecture of specific projects and a deep dive into architectures built on 
top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no 
later than March 1st, 2011. Acceptance notifications will be sent out soon 
after 
the submission deadline. Please include your name, bio and email, the title of 
the talk, a brief abstract in English language. Please indicate whether you 
want 
to give a lightning (10min), short (20min) or long (40min) presentation and 
indicate the level of experience with the topic your audience should have (e.g. 
whether your talk will be suitable for newbies or is targeted for experienced 
users.) If you'd like to pitch your brand new product in your talk, please let 
us know as well - there will be extra space for presenting new ideas, awesome 
products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to 
provide 
videos after the event, free drinks for attendees as well as an after-show 
party), please contact us.

Follow @hadoopberlin on Twitter for updates. Tickets, news on the conference, 
and the final schedule are be published at http://berlinbuzzwords.de.

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Please re-distribute this CfP to people who might be interested.

If you are local and wish to meet us earlier, please note that this Thursday 
evening there will be an Apache Hadoop Get Together (videos kindly sponsored by 
Cloudera, venue kindly provided for free by Zanox) featuring talks on Apache 
Hadoop in production as well as news on current Apache Lucene developments.

Contact us at:

newthinking communications 
GmbH Schönhauser Allee 6/7 
10119 Berlin, 
Germany 

Julia Gemählich
Isabel Drost 

+49(0)30-9210 596


signature.asc
Description: This is a digitally signed message part.


Re: Interested Mahout developers in the UK (or Europe?)

2011-01-12 Thread Isabel Drost
On Tue, 11 Jan 2011 Sean Owen sro...@gmail.com wrote:

 If that describes you, you can respond to me privately and I'll make
 sure to make the connection when I see some interesting stuff going on
 here.

Same here for Germany (or Europe)

Please also consider adding yourself to our Professional Support wiki
page
https://cwiki.apache.org/confluence/display/MAHOUT/Professional+Support

Isabel


Re: Adding user classes to Mahout's MR jobs.

2011-01-12 Thread Isabel Drost
On Tue, 11 Jan 2011 Dmitriy Lyubimov dlie...@gmail.com wrote:
 It's probably a little bit more of a Hadoop question though but as
 far as i know that's not as easy as specifying additional jars for
 java -cp option, is it?

When using the mahout shell script it should be as easy as defining a
CLASSPATH variable that contains these classes. The script should take
up this variable and extend it by all dependencies Mahout itself needs.

Similar setups are available when running on a Hadoop cluster.

Isabel


Re: Seq2Sparse and Collocation

2010-12-16 Thread Isabel Drost
On Fri, 10 Dec 2010 Sreejith S srssreej...@gmail.com wrote:
 I have a text file and i converted it in to sequence file.Then i
 created sparse vectors using seq2sparse.Now i would like to take all
 the collocation generated.
 Pls say how to execute CollocDriver in command prompt.

There is a description in our wiki:

https://cwiki.apache.org/confluence/display/MAHOUT/Collocations

In addition any driver in Mahout supports the --help option to print
details on command line options.

Isabel


Apache Mahout Hackathon - Berlin - Feb 2011

2010-12-14 Thread Isabel Drost

Hello,

early 2011 - on February 19th/20th to be more precise - the first Apache Mahout 
Hackathon is scheduled to take place at c-base in Berlin. The Hackathon will 
take one weekend. There will be plenty of time to hack on your favourite Mahout 
issue, to get in touch with local Mahout committers, get your machine learning 
project off the ground. The venue features a bar that sells drinks (including 
Club Mate) so no need to bring those.

Please register at https://www.xing.com/events/apache-mahout-hackathon-647603 
if 
you are planning to attend this event so we can plan for enough space for 
everyone. If you have not registered for the event there is no guarantee you 
will be admitted.

If you'd like to support the event: We'd love to provide pizza and drinks for 
free. If you are interested in sponsoring, please contact me at 
isa...@apache.org

A special Thank You to c-base for providing the location free of charge.

Feel free to forward this information to anyone who might be interested, tweet 
the event, include information on your blog if you are attending. Check the 
above link to learn of potential changes.

Looking forward to a fun and productive weekend,
Isabel


signature.asc
Description: This is a digitally signed message part.


DataDevRoom at the 2011 edition of the FOSDEM

2010-12-07 Thread Isabel Drost
Hello,

We (Olivier, Nicolas and I) are organizing a Data Analytics DevRoom
that will take place during the next edition of the FOSDEM in Brussels
on Feb. 5. Here is the CFP:

  http://datadevroom.couch.it/CFP

You might be interested in attending the event and take the
opportunity to speak about your projects. 

Important Dates (all dates in GMT +2):

Submission deadline:  2010-12-17
Notification of accepted speakers: 2010-12-20
Publication of final schedule:  2011-01-10
Meetup: 2011-02-05

The event will comprise presentations on scalable data processing. We
invite you to submit talks on the topics: Information retrieval / Search
Large Scale data processing, Machine Learning, Text Mining, Computer
vision, Linked Open Data.

High quality, technical submissions are called for, ranging from
principles to practice. We are looking for presentations on the
implementation of the systems themselves, real world applications and
case studies.

Submissions should be based on free software solutions.

Looking forward to meeting you face to face in Brussels,
Isabel


Re: Checkouts and branches

2010-12-05 Thread Isabel Drost
On 05.12.2010 Lance Norskog wrote:
 Where is the branch/tag for 0.4?

In Mahout's repository at /tags/mahout-0.4 - see also 
http://svn.apache.org/viewvc/mahout/tags/mahout-0.4/

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: Bayes Question.

2010-11-25 Thread Isabel Drost
On Thu, 25 Nov 2010 JAGANADH G jagana...@gmail.com wrote:
  Or is it enough to train with either of good or bad.?
 
 It will be something like train a person to identify 'sweet' by giving
 'salt' as sample

There are some domains where it may make sense to formulate a task as
one-class classification problem. E.g. looking at time series data one
might want to train a model to identify normal behaviour from
positive data only.

Though it is possible to come up with algorithms for this so-called
one-class classification problem*, I am not aware of any implementation
in Mahout.


Isabel

* For instance see One-Class SVMs for Document Classification by
  Larry m. Manevits and Malik Yousef for some references and comparison.



Re: classification algorithm

2010-11-22 Thread Isabel Drost
On Thu, 18 Nov 2010 Radu Spineanu r...@timisoara.roedu.net wrote:
 I'm a Debian Developer and I noticed Mahout is not in Debian. If I'm 
 able to wrap my head around everything and get it working I would
 love to contribute back and package it.

That would be awesome. Mahout does have quite a few dependencies which
might make it an interesting packaging exercise. I am not sure whether
all of them are available in Debian already. At least Hadoop should be
available in Debian testing, but did not yet make it to the latest
stable release.

Isabel


Re: Mahout in talk

2010-11-12 Thread Isabel Drost
On 12.11.2010 JAGANADH G wrote:
 I will be giving a talk on Machine Learning in the BarCap Kerala9 . I have
 included Mahout in the talk too.
 I will give demo of recommendation and Classification with Mahout.

Would be great if you could put your slides up online in our wiki (if you'll 
use 
any slides): https://cwiki.apache.org/MAHOUT/books-tutorials-and-talks.html

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: How to Cluster?

2010-10-26 Thread Isabel Drost
On Fri, 22 Oct 2010 SIAVASH GHODSI MOGHADDAM gmsiava...@live.utm.my
wrote:
 What I am looking for now, is a Clustering Code Sample.

Did you have a look at the examples module of Mahout? There is also
quite some documentation in the Mahout wiki to get you started.

Isabel



Re: Mahout dependencies on windows

2010-10-26 Thread Isabel Drost
On Mon, 25 Oct 2010 22:54:42 +0100
Steven Bourke sbou...@gmail.com wrote:

 Ted - Has mahout got an image up on EC2 that anyone can use or do we
 have to build from scratch?

None that I'm aware of, however building from scratch should be fairly
easy:

https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce

Isabel


FYI: Fw: Fast Feather Track at ApacheCon - submit your talk now!

2010-10-14 Thread Isabel Drost

I thought the following might be a nice option to present your
awesome Mahout use case to a broader audience - or maybe tell
others what you did for GSoC.


Begin forwarded message:

Date: Mon, 4 Oct 2010 22:54:09 +0100 (BST)
From: Nick Burch nick.bu...@alfresco.com
To: gene...@incubator.apache.org
Subject: Fast Feather Track at ApacheCon - submit your talk now!


Hi All

We've under a month to go now to Atlanta, and hopefully you've all
registered and are all looking forward to a great week?

Other than our packed schedule of talks, our growing list of meetups
(see Shane's email from Friday for more details on hosting your own),
what could be more exciting than your Next Big Thing? Well, that's
where the Fast Feather Track comes in!

The Fast Feather Track provides space for the projects that are just
too new or fast-moving to fit in to the normal CFP. It's especially a
great slot for new incubator projects to talk about what they're up to,
or share their passion for some new technology out there. So, this is
your chance of twenty minutes of fame for your incubating project :)

The Fast Feather Track is all about the technology - so whether you're
a novice or a natural at public speaking, there's room for you!
Anything new at The Apache Software Foundation belongs here, along with
new external technologies that can help us work better. Whether you're
ready for showtime and want the world to know, or you're still finding
your feet, and just fishing for a few new contributors / mentors, this
is the slot for you!

What we're after now is people to tell us what they want to talk about. 
We've got a room for a day, and a big empty schedule board with 20
minute slots on it, so now all we need is some talks to fill it with...
We're aiming to fix most of the schedule now, but we'll probably keep a
few slots spare for some last minute talks. But if you've already got
your ticket, and you know what you want to talk about, please let us
know now, so we can make sure there's space for everyone.

To submit your talk, please head over to google docs and tell us about
yourself and your talk:
https://spreadsheets.google.com/viewform?formkey=dElobGxibG1oc05OeFNqRFZ1S0tpLVE6MQ

We'll see you, your projects, and your great short talks in Atlanta!

Nick

(NB Speaking in the Fast Feather Track does not entitle you to the full
range of speaker perks - you'll get a shiny badge, and your bio in the
program, but you won't get your travel, room or registration comped.
Doesn't make it any less fun though, we promise!)

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Mahout usage

2010-10-01 Thread Isabel Drost
On Thu, 30 Sep 2010 Grant Ingersoll gsing...@apache.org wrote:
 Now, if we could just get people to add to the Powered By page!

Anyone ever successfully convinced a Mahout (or Lucene etc.) user to put
their name on the Powered By? I'd be interested in learning more on the
arguments that worked for others...


Isabel


Re: Mahout usage

2010-10-01 Thread Isabel Drost
On Fri, 1 Oct 2010 Grant Ingersoll gsing...@apache.org wrote:
 I'm working on a few...  I know they are out there, as they email in
 private. 

Same here: One huge fear that people seem to have is to reveal the
inner workings of their system not only to the public but also to
potential competitors by putting their name on our list.

Isabel
 


Re: Text Classification using Mahout

2010-09-30 Thread Isabel Drost
On Thu, 30 Sep 2010 Sean Owen sro...@gmail.com wrote:

 Ignore it, it's just Maven doing its thing in the background. It
 should work fine without internet connectivity.

To speed up the build process when you do not have internet
connectivity you can give a -o to the command line to tell maven that
you are not connected. That way it does not go and try to check for
updates.

Isabel


SGD example

2010-09-19 Thread Isabel Drost

Hi,

I just tried running the SGD example with the following command line (adapted 
from the corresponding JIRA issue):

./bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 
50 --lambda 0.001 --input examples/src/main/resources/donut.csv --features 21 --
output donut.model --target color --categories 2 --predictors x y xx xy yy a b 
c 
--types n n

When running the code above I ran into a few NullPointerExceptions - I was able 
to fix them with a few tiny changes. If not stripped they should be attached to 
this mail to highlight the lines of code that caused the trouble. However I was 
wondering whether I simply used the wrong command line.

Isabel
diff --git a/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java b/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java
index 5cbdef2..bde3021 100644
--- a/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java
+++ b/core/src/main/java/org/apache/mahout/classifier/sgd/CsvRecordFactory.java
@@ -243,8 +243,9 @@ public class CsvRecordFactory implements RecordFactory {
   if (predictor = 0) {
 value = values.get(predictor);
   } else {
-value = null;
+value = null;
   }
+System.out.println(value);
   predictorEncoders.get(predictor).addToVector(value, featureVector);
 }
 return targetValue;
diff --git a/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java b/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java
index d76fd81..3112681 100644
--- a/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java
+++ b/core/src/main/java/org/apache/mahout/vectors/ConstantValueEncoder.java
@@ -34,7 +34,7 @@ public class ConstantValueEncoder extends FeatureVectorEncoder {
 for (int i = 0; i  probes; i++) {
   int n = hashForProbe(originalForm, data.size(), name, i);
   if(isTraceEnabled()){
-trace((byte[]) null, n);
+trace(new byte[]{}, n);
   }
   data.set(n, data.get(n) + getWeight(originalForm,weight));
 }
diff --git a/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java b/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java
index 30cd353..3f7d1d5 100644
--- a/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java
+++ b/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java
@@ -132,6 +132,8 @@ public final class TrainLogistic {
 
   private static double predictorWeight(OnlineLogisticRegression lr, int row, RecordFactory csv, String predictor) {
 double weight = 0;
+if (csv.getTraceDictionary().get(predictor) == null)
+  return 0;
 for (Integer column : csv.getTraceDictionary().get(predictor)) {
   weight += lr.getBeta().get(row, column);
 }


signature.asc
Description: This is a digitally signed message part.


Apache Hadoop Get Together Berlin October 2010 - this time with a huge Mahout focus

2010-09-15 Thread Isabel Drost

Hello,

this is to announce the next Apache Hadoop Get Together sponsored by
JTeam (http://www.jteam.nl) that will take place in newthinking store
in Berlin.

When: October 7th, 5p.m.
Where: Newthinking store Berlin

As always there will be slots of 30min each for talks on your Hadoop
topic. After each talk there will be a lot time to discuss. You can
order drinks directly at the bar in the newthinking store. If you like,
you can order pizza. We will go to Cafe Aufsturz after the event for
some beer and something to eat.

Talks scheduled so far:

Max Heimel: Hidden Markov Models for Apache Mahout

Abstract: In this talk I will present and discuss an implementation of
a powerful statistical tool called Hidden Markov Models for the Apache
Mahout project. Hidden Markov models allow to mathematically deduce the
structure of an underlying - and unobservable - process based on the
structure of the produced data. Hidden Markov Models are thus
frequently applied in pattern recognition to deduce structures that are
not directly observable. Examples for applications of Hidden Markov
Models include the recognition of syllables in speech recordings,
handwritten letter recognition and part-of-speech tagging.

Sebastian Schelter: Distributed Itembased Collaborative Filtering with
Apache Mahout

Abstract: Recommendation Mining helps users find items they like. A
very popular way to implement this is by using Collaborative Filtering.
This talk will give an introduction to an approach called Itembased
Collaborative Filtering and explain Mahout's Map/Reduce based
implementation of it.

Please do indicate on Upcoming
http://upcoming.yahoo.com/event/6792156 or on Xing
https://www.xing.com/events/apache-hadoop-berlin-october-2010-564265
if you are coming so we can more safely plan capacities. Updates to the
event, a brief summary and videos will be posted on
http://isabel-drost.de/hadoop

JTeam is looking for Java developers and search enthusiasts. Check out
their jobs page (http://www.jteam.nl/Jobs/Jobs.html) for more info!

As always a big Thank You goes to newthinking store for providing the
venue for free for our event.

Looking forward to seeing you in Berlin as well,
Isabel


Re: how is the Vector format?

2010-09-10 Thread Isabel Drost
On Sun, 5 Sep 2010 Valerio Ceraudo valerio.cera...@gmail.com wrote:
 ok ok I can run your arffToVector in
 org.apache.utils.vectors.arff.Driver but i found a bug, it doesn't
 recognize the attribute REAL, so I changed the arff attributes in
 NUMERIC and it works,now I have got a iris.arff.MVC file.

Any chance you might have some time to file a JIRA issue for that - or
maybe even provide a patch that fixes the issue?

Isabel



Re: Version compatibility of Mahout 0.4-SNAPSHOT with Hadoop release?

2010-07-19 Thread Isabel Drost
On Thu Peter M. Goldstein peter_m_goldst...@yahoo.com wrote:
 Yes, my original email should have said 0.20.2+320.  Sorry about the
 typo.  You can find that version here:
 
 http://archive.cloudera.com/cdh/3/

Or at Debian Squeeze (http://packages.qa.debian.org/h/hadoop.html) or
of course directly from the Apache Hadoop project.


 And it does explicitly say 0.20.2 on the Mahout on Amazon EC2 wiki
 page.

Just for further reference - system requirements for Mahout are tracked
on the wiki page named accordingly:

https://cwiki.apache.org/confluence/display/MAHOUT/System+Requirements


Isabel


Re: ICML / COLT and Mahout

2010-07-12 Thread Isabel Drost
On Wed Danny Leshem dles...@gmail.com wrote:
 I took a different track, so only had a chance to chat with some of
 the open-source participants during their poster session. Most of
 them never heard of Mahout, or only heard of it by name.

Would you be interested in introducing Mahout to the ICML/COLT people
in a future workshop or in JMLR MLOSS?

I am sure the Mahout community would be more then happy to help you
proof-read your publication.

Isabel


Re: Installing Mahout

2010-06-17 Thread Isabel Drost
On Thu tammuz rasil...@gmail.com wrote:
 Well this is what I note during the installation:
 
 Running org.apache.mahout.clustering.TestPrintableInterface
 Tests run: 22, Failures: 19, Errors: 0, Skipped: 0, Time elapsed:
 0.427 sec  FAILURE!

In case of failing tests you should be able to see more information
when looking into

$module-name/target/surefire-reports/org.apache.mahout.clustering.TestPrintableInterface.txt
 

The content of that file should help diagnose the problem for us as
well.

Isabel


Re: Getting started with mahout

2010-06-11 Thread Isabel Drost
On Tue Jeff Eastman j...@windwardsolutions.com wrote:
 that you can browse for historical purposes. As a way of getting
 started, I'd suggest learning to run some of the examples. If one of
 our algorithms seems most interesting, jump into its unit tests and
 begin to explore the code.

Some more information on how to get started contributing to Mahout:

https://cwiki.apache.org/MAHOUT/howtocontribute.html

Isabel


Re: Gephi graph visualization

2010-06-11 Thread Isabel Drost
On Thu Grant Ingersoll gsing...@apache.org wrote:
 Stefan G. gave a nice demo of this at Buzzwords (http://gephi.org/)
 and I tried it out on the plane ride home and it seems like it could
 be used as a nice way to visualize clusters.  It can import a CSV
 file that is essentially a big matrix of nodes and edges.  I think it
 wouldn't be too hard to have a job that converts the clusters into
 this CSV format for easy loading.

+1

I used Gephi for graph visualisation earlier this year - it seems
capable of handling reasonably sized graphs and makes understanding
their structure really easy. Being an interactive tool it's also
helpful in exploring your linked data.

Would be great to be able to say import the result of our clustering
jobs into gephi.

Isabel