from:"Ted Dunning"

Re: Hi ... need some help?

2020-04-22 Thread Ted Dunning

Chris,

This is really nice work.



On Wed, Apr 22, 2020 at 1:46 AM Christofer Dutz 
wrote:

> Hi Andrew,
>
> thanks for your kind words ... they are sort of the fuel that makes me run
> ;-)
>
> So some general observations and suggestions:
> - You seem to use test-jars quite a bit: These are generally considered an
> antipattern as you possibly import problems from another module and you
> will have no way of detecting them. If you need shared test-code it's
> better practice to create a dedicated test-utils module and include that
> wherever it's needed.
> - Don't use variables for project dependencies: It makes things slightly
> more difficult to read the release plugin takes care of updating version
> for you and some third party plugins might have issues with it.
> - I usually provide versions for all project dependencies and have all
> other dependencies managed in a dependencyManagement section of the root
> module this avoids problems with version conflicts when constructing
> something using multiple parts of your projects (Especially your lib
> directory thing)
> - Accessing resources outside of the current modules scope is generally
> considered an antipattern ... regarding your lib thing, I would suggest an
> assembly that builds a directory (but I do understand that this version
> perhaps speeds up the development workflow ... we could move the clean
> plugin configuration and the antrun plugin config into a profile dedicated
> for development)
> - I usually order the plugin configurations (as much as possible) the way
> they are usually executed in the build ... so: clean, process resources,
> compile, test, package, ... this makes it easier to understand the build in
> general.
>
> Today I'll go through the poms again managing all versions and cleaning up
> the order of things. Then if all still works I would bump the dependencies
> versions up as much as possible.
>
> Will report back as soon as I'm though or I've got something to report ...
> then I'll also go into details with your feedback (I haven't ignored it ;-)
> )
>
> Chris
>
>
>
> Am 22.04.20, 06:08 schrieb "Andrew Palumbo" :
>
> Fixing previous message..
>
>
> Quote from Chris Dutz:
>
> > Hi folks,
> >so I was now able to build (including all tests) with Java 8 and
> 9 ... currently trying 10 ...
> >Are there any objection that some maven dependencies get updated
> to more recent versions? I mean ... the hbase-client you're using is more
> than 5 years old ...
>
> My answer:
>
> I personally have no problem with the updating of any dependencies,
> they may break some things and caue more work, but that is the kind of
> thing that we've been trying to get done in this build work,  get
> everything up to speed.
>
> Id say take Andrew, Trevor and Pat's word over mine though i am a bit
> less active presently.
>
> Thanks.
>
> Andy
>
> 
> From: Andrew Palumbo 
> Sent: Tuesday, April 21, 2020 10:17 PM
> To: dev@mahout.apache.org 
> Subject: Re: Hi ... need some help?
>
>   Hi folks,
>
> so I was now able to build (including all tests) with Java 8 and 9
> ... currently trying 10 ...
>
> Are there any objection that some maven dependencies get updated
> to more recent versions? I mean ... the hbase-client you're using is more
> than 5 years old ...
> Not by me, I believe that is being used by the MR module, which is
> Deprecated.
>
> I personally have no problem with the updating of any dependencies,
> they may break some things and caue more work, but that is the kind of
> thing that we've been trying to get done in this build work,  get
> everything up to speed.
>
> Id say take Andrew, Trevor and Pat's word over mine though i am a bit
> less active presently.
>
> Thanks.
>
> Andy
> 
> From: Andrew Palumbo 
> Sent: Tuesday, April 21, 2020 10:13 PM
> To: dev@mahout.apache.org 
> Subject: Re: Hi ... need some help?
>
> Chris, Thank you so much for what you are doing,  This is Apache at
> its best.. I've been down and out with a serious Illness, Injury and other
> issues, which have seriously limited my Machine time.   I was pretty close
> to getting a good build, but it was hacky, and the method that you use to
> name the modules for both Scala versions, looks great.
>
> We've always relied on Stevo to fix the builds for us,  but as he said
> is unable to contribute right now.  The main issues (solved by hacks),
> currently are
>
>
>   1.  Dependencies and transitive dependencies  are not being picked
> and copied to the `./lib` directory, where `/bin/mahout` and parts of the
> MahoutSparkContext look for them, to add to the class path.  So running
> either from the CLI or as a library, dependencies are not picked up.
>  *   We used to use the mahout-experimental-xx.jar as a fat jar
> for this, though it was bloated with now

Re: [jira] [Created] (MAHOUT-2094) Advanced Excel Training In Pune

2020-02-29 Thread Ted Dunning

I already took a look. I couldn't even delete the post.

I did file an infra JIRA to block the poster.

Anybody can file a similar JIRA to limit issue creation to committers. I
would wait for a second occurrence, however.

On Sat, Feb 29, 2020 at 12:29 PM Giorgio Zoppi 
wrote:

> Ok,
> ted, could you help on this?
> BR,
> Giorgio
>

Re: [jira] [Created] (MAHOUT-2092) Machine Learning is an extensive area of Artificial Intelligence focused on the classical design

2020-02-24 Thread Ted Dunning

I removed the spammy content and asked infra to blacklist the poster.



On Mon, Feb 24, 2020 at 2:59 AM Giorgio Zoppi 
wrote:

> This should be not permitted. We dont care about ML courses, if we want a
> course we look for ourselves.
> BR,
> Giorgio
>
> El lun., 24 feb. 2020 a las 11:45, Diksha Kakade (Jira) ( >)
> escribió:
>
> > Diksha Kakade created MAHOUT-2092:
> > -
> >
> >  Summary: Machine Learning is an extensive area of Artificial
> > Intelligence focused on the classical design
> >  Key: MAHOUT-2092
> >  URL: https://issues.apache.org/jira/browse/MAHOUT-2092
> >  Project: Mahout
> >   Issue Type: Blog - New Blog Request
> >   Components: Classification
> > Affects Versions: 0.12.2
> > Reporter: Diksha Kakade
> >  Fix For: 14.2
> >  Attachments: Machine_Learning_Header-compressor.jpg
> >
> > Master in Machine learning, Artificial Intelligence and Big Data workshop
> > as part of their AI and Deep Learning training at SevenMentor training in
> > Pune. Our Machine Learning course in Pune at SevenMentor, syllabus
> > comprises the latest algorithms such as ANN, MLP RNN Autoencoders and
> > moreover this app is considered to be the best Machine learning class in
> > this region. There are a whole lot of amazing Artificial intelligence
> > projects offered and nearly many of our candidates went to integrate with
> > the fortune 100 firms. Students studying artificial intelligence training
> > and Machine learning education, big data training are rigorously trained
> > using live sector applicable case studies. What are you waiting for?
> > Register now for the absolute best Machine Learning course in Pune at
> > SevenMentor training pioneering your career into the AI companies and
> learn
> > the updated concepts.
> >
> >
> >
> >
> >
> > *Proficiency After Coaching*
> >
> > Specialist in Machine learning, Information Evaluation
> >
> >
> >
> > Willing to Operate on statistical Theories using python or R programming
> >
> >
> >
> > Willing to Operate with AI
> >
> >
> >
> > Have a Fantastic Comprehension of Data Science Algorithms
> >
> >
> >
> > Ready to solely operate on real-time Jobs with R
> >
> >
> >
> > Examine several Kinds of data using R
> >
> >
> >
> > Discover Techniques and Tools for Information Transformation
> >
> >
> >
> > Gain insights from Information and Picture it
> >
> >
> >
> > Utilize different Document formats and types of Information
> >
> >
> >
> >
> >
> >
> > *Machine Learning Training at Pune*
> >
> > Machine Learning Course Objectives
> >
> > You'll find an overview of how humongous amounts of data has been
> > generated, the best way to draw substantial business insights, techniques
> > used to analyze unstructured and structured information, newest machine
> > learning algorithms used to build innovative prediction models and the
> way
> > to visualize data. All these are learned in view of solving many complex
> > business issues and making organizations profitable. Case studies that
> are
> > industry relevant have been making our pupils achieve accolades from the
> > world's greatest businesses and stick out of the rest. We provide hands
> on
> > practical training on Machine Learning course in Pune at SevenMentor. Our
> > pupils are leaving footprints from the corporate world by becoming.
> >
> >
> >
> > *How Can Machine Learning work?*
> >
> > The design of Machine learning course in at Pune is done using a training
> > data set that a variant can be produced. The fact of the prediction is
> > assessed based on ML algorithm that is set up if the precision is
> > acceptable. For instances where precision isn't acceptable, the Machine
> > Learning algorithm is provided using supplemental training information
> > set.There are several variables and measures involved. This is a great
> > example of the process.
> >
> > Machine learning has transformed various sectors of businesses like
> > retail, healthcare, finance, etc. Depending on the trends in engineering,
> > these are a couple predictions that have been made related to Machine
> > Learning's potential .
> >
> > Personalization calculations of Machine Learning provide recommendations
> > to clients and bring them to complete certain activities. The further
> > personalization algorithms will grow more fine-tuned, which will lead to
> > favorable and effective encounters.
> >
> > With the growth in demand and use of Machine Learning, using robots grows
> > vastly.
> >
> > Improvements in vending machine learning algorithms are extremely likely
> > to be seen from the upcoming several decades. These advancements can help
> > you build improved calculations, which will result in quicker and more
> > precise machine. It will result in quicker processing of information if
> > quantum computers are incorporated into Machine Learning. This will
> quicken
> > the capacity to

Re: MathJax not renedering on Website

2017-09-10 Thread Ted Dunning

This has happened periodically to my sites. The answer is usually that the
canonical location of the mathJax JavaScript library has changed.

On Sep 10, 2017 7:58 PM, "Andrew Palumbo"  wrote:

> It looks like MathJax is not rendering tex on the site:
>
>
> Eg.:
>
>
> https://mahout.apache.org/users/algorithms/d-ssvd.html
>
> Ideas to get this going while site is being redone?
>
>
>

Re: Looking for help with a talk

2017-08-04 Thread Ted Dunning

Any time.

Ping me directly.



On Fri, Aug 4, 2017 at 1:12 AM, Isabel Drost-Fromm 
wrote:

> Hi,
>
> I have a first draft of a narrative and slide deck. If anyone has time it
> would be lovely to bounce some ideas back and forth, have the draft of the
> deck reviewed.
>
>
> Isabel
>
>

Re: Unsubscribe.

2017-06-08 Thread Ted Dunning

Glad it worked. Sad to see you go.



On Thu, Jun 8, 2017 at 4:24 AM, Roshan Kedar <rosbl...@gmail.com> wrote:

> Hi Ted,
>
> Sorry to bother you but problem was @my end. I had sent mail to
> unsubscribe but the "confirmation to unsubscribe" mail was sent to trash.
>
> Finally I did unsubscribe.
> Thanks for support.
>
> Regards
> Roshan Kedar
>
> On 8 Jun 2017 03:10, "Ted Dunning" <ted.dunn...@gmail.com> wrote:
>
>>
>> Is there a chance you subscribed under another email address.
>>
>>
>>
>> On Wed, Jun 7, 2017 at 12:40 AM, Roshan Kedar <rosbl...@gmail.com> wrote:
>>
>>> Hahaha,
>>>
>>> Two, including today's mail after your reply.
>>>
>>> Actually your mails are overwhelming in number. But it was nice working
>>> on
>>> mahout.
>>>
>>> But now working on totally different field for some time. So please
>>> unsubscribe.
>>>
>>> On 7 Jun 2017 03:07, "Trevor Grant" <trevor.d.gr...@gmail.com> wrote:
>>>
>>> > How many times have you sent an email to
>>> dev-unsubscr...@mahout.apache.org
>>> > ?
>>> >
>>> > On Tue, Jun 6, 2017 at 4:00 PM, Roshan Kedar <rosbl...@gmail.com>
>>> wrote:
>>> >
>>> > > And exactly how many times I have to unsubscribe from this
>>> newsletter?
>>> > >
>>> > > Unsubscribe me please.
>>> > >
>>> >
>>>
>>
>>

Re: Unsubscribe.

2017-06-07 Thread Ted Dunning

Is there a chance you subscribed under another email address.



On Wed, Jun 7, 2017 at 12:40 AM, Roshan Kedar  wrote:

> Hahaha,
>
> Two, including today's mail after your reply.
>
> Actually your mails are overwhelming in number. But it was nice working on
> mahout.
>
> But now working on totally different field for some time. So please
> unsubscribe.
>
> On 7 Jun 2017 03:07, "Trevor Grant"  wrote:
>
> > How many times have you sent an email to dev-unsubscribe@mahout.apache.
> org
> > ?
> >
> > On Tue, Jun 6, 2017 at 4:00 PM, Roshan Kedar  wrote:
> >
> > > And exactly how many times I have to unsubscribe from this newsletter?
> > >
> > > Unsubscribe me please.
> > >
> >
>

Re: New logo

2017-05-06 Thread Ted Dunning

On Sat, May 6, 2017 at 2:43 PM, Scott C. Cote  wrote:

> Will you be wearing “one of those t-shirts” on Monday  in Houston :)   ?
>

Not likely.

It is in the archive.

Re: New logo

2017-05-06 Thread Ted Dunning

; >> > problems, and really statistics / "machine-learning" in general, in
> >that
> >> we
> >> > can't find perfect solutions, yet we believe solutions exist and
> >serve as
> >> > our blueprint.
> >> >
> >> > Finally, I like that it is simple.
> >> >
> >> > Things I don't like about it:
> >> > Lucent Technologies used it back in the 90s, however they used a
> >very
> >> > specific red one, and this isn't a deal breaker for me.
> >> >
> >> > Other thoughts:
> >> > Based on the tattoo I saw- one could make an Enso using old mahout
> >color
> >> > palatte if one were to dab their brush in the appropriate colors.
> >This
> >> > could also be represented in any single color. (Not sure what that
> >does
> >> to
> >> > our TM, is it ok if we just keep slapping TMs on the side of it? If
> >that
> >> is
> >> > the case is there any reason we must have a single Enso?)
> >> >
> >> > So there is something to throw in the pot that is a little more
> >grown up
> >> > than my runner up favorites (honey badger, blueman riding bomb
> >waving
> >> > cowboy hat, blueman riding lighting bolt into a squirrel covered in
> >> water,
> >> > etc).
> >> >
> >> > Again, only know what wiki has told me, so if anyone is more
> >familiar
> >> with
> >> > this symbol (like was it used as a logo by some horrible dictator
> >which
> >> > carried out terrible attrocities?) or just general comments.
> >> > tg
> >> >
> >> >
> >> >
> >> > Trevor Grant
> >> > Data Scientist
> >> > https://github.com/rawkintrevo
> >> > http://stackexchange.com/users/3002022/rawkintrevo
> >> > http://trevorgrant.org
> >> >
> >> > *"Fortunate is he, who is able to know the causes of things."
> >-Virgil*
> >> >
> >> >
> >> > On Thu, Apr 27, 2017 at 5:50 PM, Ted Dunning
> ><ted.dunn...@gmail.com>
> >> wrote:
> >> >
> >> >> I don't have any constructive input at all. None of the proposals
> >showed
> >> >> any spark (to me).
> >> >>
> >> >> I hate it when I can't suggest a better path and I hate negative
> >> feedback.
> >> >> But there it is.
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel
> ><p...@occamsmachete.com>
> >> wrote:
> >> >>
> >> >>> Do you have constructive input (guidance or opinion is welcome
> >input)
> >> or
> >> >>> would you like to discontinue the contest. If the later, -1 now.
> >> >>>
> >> >>>
> >> >>> On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com>
> >> wrote:
> >> >>>
> >> >>> I thought that none of the proposals were worth continuing with.
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel
> ><p...@occamsmachete.com>
> >> >> wrote:
> >> >>>
> >> >>>> Yes, -1 means you hate them all or think the designers  are not
> >worth
> >> >>>> paying. We have to pay to continue, I’ll foot the bill
> >(donations
> >> >>>> appreciated) but don’t want to unless people think it will lead
> >to
> >> >>>> something. For me there are a couple I wouldn’t mind seeing on
> >the web
> >> >>> site
> >> >>>> or swag and yes we do have time to try something completely
> >different,
> >> >>> and
> >> >>>> the designers will be more willing since there is a guaranteed
> >payout.
> >> >>>>
> >> >>>>
> >> >>>> On Apr 27, 2017, at 3:30 PM, Andrew Musselman <
> >> >>> andrew.mussel...@gmail.com>
> >> >>>> wrote:
> >> >>>>
> >> >>>> I thought we were just voting on continuing this process :)
> >> >>>>
> >> >>>> On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant <
> >> >> trevor.d.gr...@gmail.com>
> >> >>>> w

Re: New logo

2017-04-27 Thread Ted Dunning

I haven't been active enough to feel good about an out and out -1.

Put me as -0



On Thu, Apr 27, 2017 at 3:54 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Fair enough, I think Trevor feels the same.
>
> The blue man can continue, all it takes is a -1
>
>
> On Apr 27, 2017, at 3:50 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> I don't have any constructive input at all. None of the proposals showed
> any spark (to me).
>
> I hate it when I can't suggest a better path and I hate negative feedback.
> But there it is.
>
>
>
> On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> > Do you have constructive input (guidance or opinion is welcome input) or
> > would you like to discontinue the contest. If the later, -1 now.
> >
> >
> > On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> >
> > I thought that none of the proposals were worth continuing with.
> >
> >
> >
> > On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
> >
> >> Yes, -1 means you hate them all or think the designers  are not worth
> >> paying. We have to pay to continue, I’ll foot the bill (donations
> >> appreciated) but don’t want to unless people think it will lead to
> >> something. For me there are a couple I wouldn’t mind seeing on the web
> > site
> >> or swag and yes we do have time to try something completely different,
> > and
> >> the designers will be more willing since there is a guaranteed payout.
> >>
> >>
> >> On Apr 27, 2017, at 3:30 PM, Andrew Musselman <
> > andrew.mussel...@gmail.com>
> >> wrote:
> >>
> >> I thought we were just voting on continuing this process :)
> >>
> >> On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant <trevor.d.gr...@gmail.com
> >
> >> wrote:
> >>
> >>> Also Pat, thank you for organizing.
> >>>
> >>> +0
> >>>
> >>> I don't love any of them enough to +1, I don't hate them all enough to
> > -1
> >>>
> >>> Most of them remind me of some spin on Apache Apex, Python, Numpy (a
> >> Python
> >>> Library), or IBM's DSX.  However, I realize a big part of that is the
> >>> colors chosen.
> >>>
> >>> #143 is my favorite (possibly because it reminds me of none of the
> >> above).
> >>> But possibly if this goes to next round we can have them adjust hues /
> >>> colors.
> >>>
> >>> Trevor Grant
> >>> Data Scientist
> >>> https://github.com/rawkintrevo
> >>> http://stackexchange.com/users/3002022/rawkintrevo
> >>> http://trevorgrant.org
> >>>
> >>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >>>
> >>>
> >>> On Thu, Apr 27, 2017 at 5:15 PM, Andrew Musselman <
> >>> andrew.mussel...@gmail.com> wrote:
> >>>
> >>>> +1 to continue; thanks for organizing this Pat!
> >>>>
> >>>> My personal favorite is #38
> >>>> https://images-platform.99static.com/I9quDzcBrtJXg_
> >>> NMaIsH6ySQ7Ok=/filters:
> >>>> quality(100)/99designs-contests-attachments/84/84017/
> >> attachment_84017937
> >>>>
> >>>> I like the stylized and simple "M" and it reminds me of diagrams
> > showing
> >>>> vector multiplication.
> >>>>
> >>>> On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel <p...@occamsmachete.com>
> >>>> wrote:
> >>>>
> >>>>> We can treat this like a release vote, if anyone hates all these and
> >>>>> doesn’t want to continue with shortlisted designers for 3 more days
> >>> (the
> >>>>> next step) vote -1 and say if your vote is binding (your are a PMC
> >>>> member)
> >>>>>
> >>>>> Otherwise all are welcome to rate everything on the polls below.
> >>>>>
> >>>>> In this case you have 24 hours to vote
> >>>>>
> >>>>> Here’s my +1 to continue refining.
> >>>>>
> >>>>>
> >>>>> On Apr 27, 2017, at 11:41 AM, Pat Ferrel <p...@occamsmachete.com>
> >>> wrote:
> >>>>>
> >>>>> Here is a second group, hopefully picked to be unique.
> >>>>> https://99designs.com/contests/poll/vl7xed
> >>>>>
> >>>>> We got a lot of responses, these 2 polls contain the best afaict.
> >>>>>
> >>>>>
> >>>>> On Apr 27, 2017, at 11:25 AM, Pat Ferrel <p...@occamsmachete.com>
> >>> wrote:
> >>>>>
> >>>>> Vote: https://99designs.com/contests/poll/rqcgif
> >>>>>
> >>>>> We asked for something “mathy” and asked for no elephant and rider.
> We
> >>>>> have the rest of the week to tweak so leave comments about what you
> >>> like
> >>>> or
> >>>>> would like to change.
> >>>>>
> >>>>> We don’t have to pick one of these, so if you hate them all, make
> that
> >>>>> known too.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>
>

Re: New logo

2017-04-27 Thread Ted Dunning

I don't have any constructive input at all. None of the proposals showed
any spark (to me).

I hate it when I can't suggest a better path and I hate negative feedback.
But there it is.



On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Do you have constructive input (guidance or opinion is welcome input) or
> would you like to discontinue the contest. If the later, -1 now.
>
>
> On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> I thought that none of the proposals were worth continuing with.
>
>
>
> On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> > Yes, -1 means you hate them all or think the designers  are not worth
> > paying. We have to pay to continue, I’ll foot the bill (donations
> > appreciated) but don’t want to unless people think it will lead to
> > something. For me there are a couple I wouldn’t mind seeing on the web
> site
> > or swag and yes we do have time to try something completely different,
> and
> > the designers will be more willing since there is a guaranteed payout.
> >
> >
> > On Apr 27, 2017, at 3:30 PM, Andrew Musselman <
> andrew.mussel...@gmail.com>
> > wrote:
> >
> > I thought we were just voting on continuing this process :)
> >
> > On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> > wrote:
> >
> >> Also Pat, thank you for organizing.
> >>
> >> +0
> >>
> >> I don't love any of them enough to +1, I don't hate them all enough to
> -1
> >>
> >> Most of them remind me of some spin on Apache Apex, Python, Numpy (a
> > Python
> >> Library), or IBM's DSX.  However, I realize a big part of that is the
> >> colors chosen.
> >>
> >> #143 is my favorite (possibly because it reminds me of none of the
> > above).
> >> But possibly if this goes to next round we can have them adjust hues /
> >> colors.
> >>
> >> Trevor Grant
> >> Data Scientist
> >> https://github.com/rawkintrevo
> >> http://stackexchange.com/users/3002022/rawkintrevo
> >> http://trevorgrant.org
> >>
> >> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >>
> >>
> >> On Thu, Apr 27, 2017 at 5:15 PM, Andrew Musselman <
> >> andrew.mussel...@gmail.com> wrote:
> >>
> >>> +1 to continue; thanks for organizing this Pat!
> >>>
> >>> My personal favorite is #38
> >>> https://images-platform.99static.com/I9quDzcBrtJXg_
> >> NMaIsH6ySQ7Ok=/filters:
> >>> quality(100)/99designs-contests-attachments/84/84017/
> > attachment_84017937
> >>>
> >>> I like the stylized and simple "M" and it reminds me of diagrams
> showing
> >>> vector multiplication.
> >>>
> >>> On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel <p...@occamsmachete.com>
> >>> wrote:
> >>>
> >>>> We can treat this like a release vote, if anyone hates all these and
> >>>> doesn’t want to continue with shortlisted designers for 3 more days
> >> (the
> >>>> next step) vote -1 and say if your vote is binding (your are a PMC
> >>> member)
> >>>>
> >>>> Otherwise all are welcome to rate everything on the polls below.
> >>>>
> >>>> In this case you have 24 hours to vote
> >>>>
> >>>> Here’s my +1 to continue refining.
> >>>>
> >>>>
> >>>> On Apr 27, 2017, at 11:41 AM, Pat Ferrel <p...@occamsmachete.com>
> >> wrote:
> >>>>
> >>>> Here is a second group, hopefully picked to be unique.
> >>>> https://99designs.com/contests/poll/vl7xed
> >>>>
> >>>> We got a lot of responses, these 2 polls contain the best afaict.
> >>>>
> >>>>
> >>>> On Apr 27, 2017, at 11:25 AM, Pat Ferrel <p...@occamsmachete.com>
> >> wrote:
> >>>>
> >>>> Vote: https://99designs.com/contests/poll/rqcgif
> >>>>
> >>>> We asked for something “mathy” and asked for no elephant and rider. We
> >>>> have the rest of the week to tweak so leave comments about what you
> >> like
> >>> or
> >>>> would like to change.
> >>>>
> >>>> We don’t have to pick one of these, so if you hate them all, make that
> >>>> known too.
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >
> >
>
>

Re: New logo

2017-04-27 Thread Ted Dunning

I thought that none of the proposals were worth continuing with.



On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel  wrote:

> Yes, -1 means you hate them all or think the designers  are not worth
> paying. We have to pay to continue, I’ll foot the bill (donations
> appreciated) but don’t want to unless people think it will lead to
> something. For me there are a couple I wouldn’t mind seeing on the web site
> or swag and yes we do have time to try something completely different, and
> the designers will be more willing since there is a guaranteed payout.
>
>
> On Apr 27, 2017, at 3:30 PM, Andrew Musselman 
> wrote:
>
> I thought we were just voting on continuing this process :)
>
> On Thu, Apr 27, 2017 at 3:22 PM, Trevor Grant 
> wrote:
>
> > Also Pat, thank you for organizing.
> >
> > +0
> >
> > I don't love any of them enough to +1, I don't hate them all enough to -1
> >
> > Most of them remind me of some spin on Apache Apex, Python, Numpy (a
> Python
> > Library), or IBM's DSX.  However, I realize a big part of that is the
> > colors chosen.
> >
> > #143 is my favorite (possibly because it reminds me of none of the
> above).
> > But possibly if this goes to next round we can have them adjust hues /
> > colors.
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Thu, Apr 27, 2017 at 5:15 PM, Andrew Musselman <
> > andrew.mussel...@gmail.com> wrote:
> >
> >> +1 to continue; thanks for organizing this Pat!
> >>
> >> My personal favorite is #38
> >> https://images-platform.99static.com/I9quDzcBrtJXg_
> > NMaIsH6ySQ7Ok=/filters:
> >> quality(100)/99designs-contests-attachments/84/84017/
> attachment_84017937
> >>
> >> I like the stylized and simple "M" and it reminds me of diagrams showing
> >> vector multiplication.
> >>
> >> On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel 
> >> wrote:
> >>
> >>> We can treat this like a release vote, if anyone hates all these and
> >>> doesn’t want to continue with shortlisted designers for 3 more days
> > (the
> >>> next step) vote -1 and say if your vote is binding (your are a PMC
> >> member)
> >>>
> >>> Otherwise all are welcome to rate everything on the polls below.
> >>>
> >>> In this case you have 24 hours to vote
> >>>
> >>> Here’s my +1 to continue refining.
> >>>
> >>>
> >>> On Apr 27, 2017, at 11:41 AM, Pat Ferrel 
> > wrote:
> >>>
> >>> Here is a second group, hopefully picked to be unique.
> >>> https://99designs.com/contests/poll/vl7xed
> >>>
> >>> We got a lot of responses, these 2 polls contain the best afaict.
> >>>
> >>>
> >>> On Apr 27, 2017, at 11:25 AM, Pat Ferrel 
> > wrote:
> >>>
> >>> Vote: https://99designs.com/contests/poll/rqcgif
> >>>
> >>> We asked for something “mathy” and asked for no elephant and rider. We
> >>> have the rest of the week to tweak so leave comments about what you
> > like
> >> or
> >>> would like to change.
> >>>
> >>> We don’t have to pick one of these, so if you hate them all, make that
> >>> known too.
> >>>
> >>>
> >>>
> >>
> >
>
>

Re: Marketing

2017-03-24 Thread Ted Dunning

On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel  wrote:

> maybe we should drop the name Mahout altogether.


I have been told that there is a cool secondary interpretation of Mahout as
well.

I think that the Hebrew word is pronounced roughly like Mahout.

מַהוּת

The cool thing is that this word means "essence" or possibly "truth". So
regardless of the guy riding the elephant, Mahout still has something to be
said for it.

(I have no Hebrew, btw)
(real speakers may want to comment here)

Re: LLR thresholds

2017-03-08 Thread Ted Dunning

MAP is dangerous, as are all off-line comparisons.

The problem is that it tends to over-emphasize precision over recall and it
tends to emphasize replicating what has been seen before.

Increasing the threshold increases precision and decreases recall. But MAP
mostly only cares about the top hit. In practice, you want lots of good
hits in the results page.



On Wed, Mar 8, 2017 at 8:18 AM, Pat Ferrel  wrote:

> The CCO algorithm now supports a couple ways to limit indicators by
> “quality". The new way is by the value of LLR. We built a t-digest
> mechanism to look at the overall density produced with different
> thresholds. The higher the threshold, the lower the number of indicators
> and the lower the density of the resulting indicator matrix but also the
> higher the MAP score (of the full recommender). So MAP seems to increase
> monotonically until it breaks down.
>
> This didn’t match my understanding of LLR, which is actually a test for
> non-correlation. I was expecting high scores to mean highly likelihood of
> non-correlation. So the actual formulation of the code must be reversing
> that so the higher the score the higher the likelihood that non-correlation
> is **false** (this is a treated as evidence of correlation)
>
> The next observation is that with high thresholds we get higher MAP scores
> from the recommender (expected) but this increases monotonically until it
> breaks down because there are so few indicators left. This leads us to the
> conclusion that MAP is not a good way to set the threshold. We tried to
> looking are precision (MAP) vs recall (number of people who get recs) and
> this gave ambiguous results with the data we had.
>
> Given my questions about how LLR is actually formulated in Mahout I’m
> unsure how to convert it into something like a confidence score or some
> other way to judge the threshold that would lead to good way to choose a
> threshold. Any ideas or illumination about how it’s being calculated or how
> to judge the threshold?
>
>
>
> Long description of motivation:
>
> LLR thresholds are needed when comparing conversion events to things that
> have very small dimensionality so maxIndicatorsPerIItem does not work well.
> For example a location by state where there are 50, maxIndicatorsPerItem
> defaults to 50 so you may end up with 50 very week indicators. If there are
> strong indicators in the data, thresholds should be the way to find them.
> This might lead to a few per item if the data supports it and this should
> then be useful. The question above is how to choose a threshold.
>

[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-08-04 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408387#comment-15408387
 ] 

Ted Dunning commented on MAHOUT-1853:
-

[~pferrel] Computing the parameters of a normal distribution is definitely 
cheaper than updating a t-digest, but I doubt that the difference will be 
visible. It takes a few additions and divisions to update the mean and sd, 
while it takes 100-200ns on average to update a t-digest with a new sample.

But the big win happens when the data being collected is grossly non-normal, or 
when the stuff of interest is an anomalous tail in an otherwise normal 
distribution. Both of these cases apply in this situation.



> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: LLR quick clarification

2016-05-12 Thread Ted Dunning

It just means that there is an association. Causation is much more
difficult to ascertain.



On Wed, May 4, 2016 at 6:06 AM, Nikaash Puri  wrote:

> Hi,
>
> Just wanted to clarify a small doubt. On running LLR with primary
> indicator as view and secondary indicator as purchase. Say, one line of the
> cross-cooccurrence matrix looks as follows:
>
> view-purchase cross-cooccurrence matrix:
>
> I1 I2:0.9, I3:0.8, ……..
> …
>
> This, in very simple terms then means that purchasing I2 should lead to
> the recommendation of viewing I1, is that correct? Of course, ignoring the
> other indicators for now.
>
> Thank you,
> Nikaash Puri

Re: About reuters-fkmeans-centroids

2016-04-28 Thread Ted Dunning

On Thu, Apr 28, 2016 at 10:54 AM, Prakash Poudyal 
wrote:

> Actually, I need to use fuzzy clustering to cluster the sentence in my
> research. I found  fuzzy k clustering algorithm in Apache Mahout, thus, I
> am trying to use it for my purpose.
>

That's great.

But that code is no longer supported.

Re: [jira] [Created] (MAHOUT-1771) Cluster dumper omits indices and 0 elements for dense vectors

2015-09-08 Thread Ted Dunning

On Tue, Sep 8, 2015 at 1:38 AM, Sean Owen (JIRA)  wrote:

> Sean Owen created MAHOUT-1771:
> -
>
>  Summary: Cluster dumper omits indices and 0 elements for
> dense vectors
>  Key: MAHOUT-1771
>  URL: https://issues.apache.org/jira/browse/MAHOUT-1771
>  Project: Mahout
>   Issue Type: Bug
>   Components: Clustering, mrlegacy
> Affects Versions: 0.9
> Reporter: Sean Owen
> Priority: Minor
>
>
> Blast from the past -- are patches still being accepted for "mrlegacy"
> code? Something turned up incidentally when working with a customer that
> looks like a minor bug in the cluster dumper code.
>
> In {{AbstractCluster.java}}:
>
> {code}
> public static List formatVectorAsJson(Vector v, String[] bindings)
> throws IOException {
>
> boolean hasBindings = bindings != null;
> boolean isSparse = !v.isDense() && v.getNumNondefaultElements() !=
> v.size();
>
> // we assume sequential access in the output
> Vector provider = v.isSequentialAccess() ? v : new
> SequentialAccessSparseVector(v);
>
> List terms = new LinkedList<>();
> String term = "";
>
> for (Element elem : provider.nonZeroes()) {
>
>   if (hasBindings && bindings.length >= elem.index() + 1 &&
> bindings[elem.index()] != null) {
> term = bindings[elem.index()];
>   } else if (hasBindings || isSparse) {
> term = String.valueOf(elem.index());
>   }
>
>   Map term_entry = new HashMap<>();
>   double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
>   if (hasBindings || isSparse) {
> term_entry.put(term, roundedWeight);
> terms.add(term_entry);
>   } else {
> terms.add(roundedWeight);
>   }
> }
>
> return terms;
>   }
> {code}
>
> Imagine a {{DenseVector}} with 5 elements, of which two are 0. It's
> considered dense in this method since the number of non-default elements is
> 5 (all elements are "non default" in a dense vector).
>
> However the iteration is over non-zero elements only. And indices are only
> printed if it's sparse (or has bindings). So the result will be the 3
> non-zero elements printed without indices. Which dimensions they are can't
> be determined.
>
> The fix seems to be either:
> - Compare number of _non-zero_ elements to the size when determining if
> it's sparse
> - Iterate over all elements if non-sparse
>
> I think the first is the intent? it would be a one-line change if so.
>
> {code}
> boolean isSparse = !v.isDense() && v.getNumZeroElements() != v.size();
> {code}
>
> Pretty straightforward, and minor, but wanted to check with everyone
> before making a change.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Re: Announcements

2015-08-19 Thread Ted Dunning

Can you set up a list of twitter handles?



On Wed, Aug 19, 2015 at 1:11 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Thanks; I'd say me, Suneel, Pat, Andrew P, Dmitriy, and Stevo could use it
 during release time.

 On Monday, August 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote:

  Not sure if Ellen will see this email.
 
  I will forward.
 
  She is happy to share access to the Twitter account via Tweetdeck to
  anybody that the PMC designates.
 
 
 
  On Mon, Aug 17, 2015 at 4:05 PM, Andrew Musselman 
  andrew.mussel...@gmail.com javascript:; wrote:
 
   Could we send out announcements through the @ApacheMahout account on
   Twitter?
  
   Ellen, if you need some help with that account let us know; we can make
  it
   part of the release process if we all have access to the handle.
  
   Thanks!

Re: Announcements

2015-08-17 Thread Ted Dunning

Not sure if Ellen will see this email.

I will forward.

She is happy to share access to the Twitter account via Tweetdeck to
anybody that the PMC designates.



On Mon, Aug 17, 2015 at 4:05 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Could we send out announcements through the @ApacheMahout account on
 Twitter?

 Ellen, if you need some help with that account let us know; we can make it
 part of the release process if we all have access to the handle.

 Thanks!

Re: July Board Report

2015-07-05 Thread Ted Dunning

On Sun, Jul 5, 2015 at 11:48 AM, Suneel Marthi smar...@apache.org wrote:

 Off late


Minor typo: This should be Of late.

Re: July Board Report

2015-07-04 Thread Ted Dunning

On Sat, Jul 4, 2015 at 12:03 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 (1) does Samsara as code name require trademark research, legally? If so,
 was any research done? (I am guessing not -- not thru apache legal anyway).


If Samsara were a project name, it would require research.  Since it will
be used with the qualifier Apache Mahout or Mahout in practice and since
those already qualify, there should be no need for another search.

Re: July Board Report

2015-07-04 Thread Ted Dunning

I think that there should be some commentary added to the report that the
project has lately had a problem with a substantial amount of off-list
design discussions and that the PMC is aware of the problem and working to
fix the problem.

At least, I think that the PMC is aware of the problem and is working to
fix it.



On Sat, Jul 4, 2015 at 10:18 AM, Suneel Marthi smar...@apache.org wrote:

 Below is the draft of the July Board report, feedback welcome.
 -

 Report from the Apache Mahout project

 ## Description:
The goal of Apache Mahout project is to build an environment for quickly
 creating scalable distributed machine learning algorithms.

 ## Activity:
  - Apache Mahout’s next generation 0.10.0 was released on April 11, 2015.
A new Math environment called ‘Samsara’ for its theme of universal
 rejuvenation was introduced in 0.10.0 release.
At Samsara’s core are general linear algebra and statistical operations
 with supporting data structures.
Mahout-Samsara reflects a rethinking of how scalable Machine Learning
 algorithms are to be built and customized.

  - Apache Mahout 0.10.1 was released on May 31, 2015. This was a minor bug
 fix release following 0.10.0.

  - Apache Mahout now supports scalable Machine Learning on Spark, H2O and
 MapReduce.

  - The project has been working closely with Apache BigTop to integrate
 Apache Mahout into BigTop following a release.

  - Integration of Apache Mahout with Apache Flink is in the works and is
 being done in collaboration with Data Artisans and TU Berlin.

  - Ted Dunning and Suneel Marthi announced the new Mahout 0.10.0 with Spark
 and H2O support at BigData Everywhere (BDE) DC Conference at Tysons Corner,
 VA on May 13, 2015

  - Anand Avati was added as a new committer.

  - Stevo Slavic was as a PMC member.

  - Team presently working on 0.10.2 release, tentatively planned for the
 week of July 10 2015.

 ## Issues:
  - None

 ## PMC/Committership changes:

  - Currently 25 committers and 14 PMC members in the project.
  - Stevo Slavić was added to the PMC on Fri May 08 2015
  - Anand Avati was added as a committer on Thu Apr 23 2015

 ## Releases:

  - 0.10.1 was released on Sun May 31 2015
  - 0.10.0 was released on Sat Apr 11 2015

 ## Mailing list activity:

  - dev@mahout.apache.org:
 - 977 subscribers (down -8 in the last 3 months):
 - 1324 emails sent to list (1419 in previous quarter)

  - u...@mahout.apache.org:
 - 1933 subscribers (down -10 in the last 3 months):
 - 243 emails sent to list (252 in previous quarter)

  - gene...@mahout.apache.org:
 - 10 subscribers (up 0 in the last 3 months):
 - 0 emails sent to list (0 in previous quarter)


 ## JIRA activity:

  - 85 JIRA tickets created in the last 3 months
  - 74 JIRA tickets closed/resolved in the last 3 months

[jira] [Commented] (MAHOUT-1746) Fix: mxA ^ 2, mxA ^ 0.5 to mean the same thing as mxA * mxA and mxA ::= sqrt _

2015-06-24 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599735#comment-14599735
 ] 

Ted Dunning commented on MAHOUT-1746:
-


I think that this is more complicated than it looks.

I just wrote a test and got really strange results.  The rate at which x*x != 
Math.pow(x,2) is not constant in my test and seems like there may be strange 
interactions with the JIT.



 Fix: mxA ^ 2, mxA ^ 0.5 to mean the same thing as mxA * mxA and mxA ::= sqrt _
 --

 Key: MAHOUT-1746
 URL: https://issues.apache.org/jira/browse/MAHOUT-1746
 Project: Mahout
  Issue Type: Blog - New Blog Request
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 0.10.2


 it so happens that in java, if x is of double type, Math.pow(x,2.0) and x * x 
 produce different values approximately once in million random values.
 This is extremely annoying as it creates rounding errors, especially with 
 things like euclidean distance computations, which eventually may produce 
 occasional NaNs. 
 This issue suggests to get special treatment on vector and matrix dsl to make 
 sure identical fpu algorithms are running as follows:
 x ^ 2 = x * x
 x ^ 0.5 = sqrt(x)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: RDMA on apache mahout

2015-06-23 Thread Ted Dunning

Pejman,

Not sure quite what you are asking.

How is implementing RDMA in Drill different than adding RDMA to Spark or
H2O (the backends that Mahout uses)?



On Tue, Jun 23, 2015 at 4:54 AM, Pejman Hosseini 
pejman.invincibl...@gmail.com wrote:

 Hello everybody!

 I want to implement RDMA on Mahout as a part of my Thesis inspired by
 Accelerating
 Big Data Processing with Hadoop, Spark, and Memcached on Datacenters with
 Modern Architectures
 http://www.cse.ohio-state.edu/%7Epanda/isca15_bigdata.pdf.
 Unfortunately I can't find any papers or references that explain or
 implemente it. I wanted to know whether it is possible at all?

 --


 *Seyyed Pejman Hosseini pejman.invincibl...@gmail.com*

Re: JIRA's with no commits

2015-06-18 Thread Ted Dunning

On Thu, Jun 18, 2015 at 7:08 AM, Suneel Marthi smar...@apache.org wrote:

 Agreed. We have been keeping all project and design discussions to dev@
 mailing lists and that's still is the case.


I just took a look at Slack and there is a long conversation on general
about the trade-offs of matrix algorithms.  Then another about the benefits
or costs of multi-backend architecture.

These are not discussions about release coordination.  They are design
discussions.

Re: JIRA's with no commits

2015-06-18 Thread Ted Dunning

Slack isn't the mailing list.



On Wed, Jun 17, 2015 at 11:43 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 We talked about it a lot, some on Slack; was work finally approved for
 donation.  I reviewed it, looked great.

 On Wednesday, June 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote:

  5k lines in a single commit?
 
  No discussion on the list?
 
 
 
  On Wed, Jun 17, 2015 at 11:26 PM, Andrew Musselman 
  andrew.mussel...@gmail.com javascript:; wrote:
 
   Sounds like part of PR 135 which is Dmitriy's 5k-line-diff drop from
 the
   other week.
  
   On Wednesday, June 17, 2015, Ted Dunning ted.dunn...@gmail.com
  javascript:; wrote:
  
A lot of JIRA's are being opened and then closed with no apparent
  commits
associated with them.
   
For example MAHOUT-1725 adds an element-wise power operation but it
 was
closed as fixed with no apparent discussion and with no commits
  attached
   to
the JIRA.
   
What is happening?

Re: JIRA's with no commits

2015-06-18 Thread Ted Dunning

5k lines in a single commit?

No discussion on the list?



On Wed, Jun 17, 2015 at 11:26 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Sounds like part of PR 135 which is Dmitriy's 5k-line-diff drop from the
 other week.

 On Wednesday, June 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote:

  A lot of JIRA's are being opened and then closed with no apparent commits
  associated with them.
 
  For example MAHOUT-1725 adds an element-wise power operation but it was
  closed as fixed with no apparent discussion and with no commits attached
 to
  the JIRA.
 
  What is happening?

Re: JIRA's with no commits

2015-06-18 Thread Ted Dunning

On Thu, Jun 18, 2015 at 12:36 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Capturing discussion in a public format and archiving the discussion would
 be preferable to fragmenting across lists, PR comments, and Slack, but the
 tools are all valuable, and until we find a way to build a digest for the
 archives I support using them all.


Actually, capturing the design discussion on the list is not just
preferable.

It is required.

Using alternative tools is fine and all, but not if it compromises that
core requirement.

JIRA's with no commits

2015-06-18 Thread Ted Dunning

A lot of JIRA's are being opened and then closed with no apparent commits
associated with them.

For example MAHOUT-1725 adds an element-wise power operation but it was
closed as fixed with no apparent discussion and with no commits attached to
the JIRA.

What is happening?

[jira] [Commented] (MAHOUT-1699) Trim down Mahout packaging for next release

2015-05-07 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533567#comment-14533567
 ] 

Ted Dunning commented on MAHOUT-1699:
-


How many of these dependencies should actually just be put into provided 
scope and thus excluded from the jar entirely?



 Trim down Mahout packaging for next release
 ---

 Key: MAHOUT-1699
 URL: https://issues.apache.org/jira/browse/MAHOUT-1699
 Project: Mahout
  Issue Type: Improvement
  Components: build
Affects Versions: 0.10.0
Reporter: Suneel Marthi
Priority: Critical
 Fix For: 0.10.1


 Mahout 0.10.0 package size is 210MB, this needs to be trimmed down to a more 
 manageable size.
 This also makes it hard to package Mahout into the BigTop distro and not to 
 mention seeking an infra waiver at the time of release for the  200MB size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Just noticed that web sites can be git based

2015-05-06 Thread Ted Dunning

There is also a proposal afoot to withdraw some of the CMS service.  The
pubsub service that publishes the html would remain.



On Wed, May 6, 2015 at 3:40 PM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 The markup and publish process is what I wonder about; the current CMS may
 be klunky but it does work and provide staging and checkpointing.

 On Wednesday, May 6, 2015, Pat Ferrel p...@occamsmachete.com wrote:

  https://docs.prediction.io/resources/intellij/
  Notice the blue edit button, bottom right. All it does is take you to the
  page on github but hitting edit there leads you through editing and
 creates
  the correct PR to their “livedocs” branch. No idea what their publish
  process is, but with a PR it seems like we can do a merge to the ASF git
  repo and get it published through the ASF process.
 
  On May 5, 2015, at 10:25 AM, Ted Dunning ted.dunn...@gmail.com
  javascript:; wrote:
 
  Can you give a pointer to such an icon?
 
 
 
  On Tue, May 5, 2015 at 6:16 PM, Pat Ferrel p...@occamsmachete.com
  javascript:; wrote:
 
   I asked to sign us up when this was first announced but haven’t heard
   back. On another project I hit an “edit” icon on their site, which
   automatically sent me to the page on github, where I was allowed to
 edit.
   This automatically created a branch in my repo and a pr to the correct
   branch of their repo. Very convenient. That way an edit icon can be put
  on
   every Mahout CMS page and users will find requesting some rewording
 quite
   easy. Notice that no write access is required since edits go through a
  PR.
  
   Not sure if the ASF implementation does this, but would be nice.
  
   On May 3, 2015, at 9:58 AM, Ted Dunning ted.dunn...@gmail.com
  javascript:; wrote:
  
   https://blogs.apache.org/infra/entry/git_based_websites_available
  
   This might be nice to get rid of the svn step in web site updates.  It
   would involve an alternative workflow for updates rather than the CMS
   process.

Re: Just noticed that web sites can be git based

2015-05-05 Thread Ted Dunning

Can you give a pointer to such an icon?



On Tue, May 5, 2015 at 6:16 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I asked to sign us up when this was first announced but haven’t heard
 back. On another project I hit an “edit” icon on their site, which
 automatically sent me to the page on github, where I was allowed to edit.
 This automatically created a branch in my repo and a pr to the correct
 branch of their repo. Very convenient. That way an edit icon can be put on
 every Mahout CMS page and users will find requesting some rewording quite
 easy. Notice that no write access is required since edits go through a PR.

 Not sure if the ASF implementation does this, but would be nice.

 On May 3, 2015, at 9:58 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 https://blogs.apache.org/infra/entry/git_based_websites_available

 This might be nice to get rid of the svn step in web site updates.  It
 would involve an alternative workflow for updates rather than the CMS
 process.

Just noticed that web sites can be git based

2015-05-03 Thread Ted Dunning

https://blogs.apache.org/infra/entry/git_based_websites_available

This might be nice to get rid of the svn step in web site updates.  It
would involve an alternative workflow for updates rather than the CMS
process.

Re: dependency-reduced jar

2015-05-02 Thread Ted Dunning

THe support commitment for t-digest either via stream-lib or directly from
the t-digest jar is the same.

I support it.

Stream-lib is a bit behind because they don't update the dependency as
often.  Otherwise, it is exactly the same software and exactly the same
support.



On Sat, May 2, 2015 at 2:41 PM, Andrew Palumbo ap@outlook.com wrote:


 On 05/02/2015 10:48 AM, Pat Ferrel wrote:

 Not removing Guava or any other dependencies from the jar. I don’t have
 time right now to fix all those Preconditions that might allow Guava to be
 removed and the other classes are needed by various Spark client code.


 +1 to dealing with the Guava precondition and assembly stuff in an other
 issue.



 Again, I propose we factor this into client and worker jars. Removing
 Preconditions may allow us to do away with the Worker jar altogether since
 guava is not used in Scala now.

 On May 1, 2015, at 2:18 PM, Pat Ferrel p...@occamsmachete.com wrote:

 removing guava shows up a bunch of uses of google Preconditions in math.
 Guess I’ll have to remove those. I’ll leave mr and the rest alone since
 only math code gets run on a spark worker.


 On May 1, 2015, at 10:01 AM, Andrew Palumbo ap@outlook.com wrote:

 ResultAnalyzer is Also used in SparkNaiveBayes.test (...).


 Sent from my Verizon Wireless 4G LTE smartphone

 div Original message /divdivFrom: Andrew Palumbo 
 ap@outlook.com /divdivDate:05/01/2015  12:57 PM  (GMT-05:00)
 /divdivTo: dev@mahout.apache.org /divdivSubject: RE:
 dependency-reduced jar /divdiv
 /div

 I added T-digest and math3. the CLI Naive Bayes driver needs them.
 Specifically the ResultAnalyzer in TestNBDriver.


 Sent from my Verizon Wireless 4G LTE smartphone

 div Original message /divdivFrom: Suneel Marthi 
 suneel.mar...@gmail.com /divdivDate:05/01/2015  12:14 PM
 (GMT-05:00) /divdivTo: mahout dev@mahout.apache.org
 /divdivSubject: Re: dependency-reduced jar /divdiv
 /divT-digest is being used in Mahout-MR, I believe its also packaged as
 part of
 Spark - AddThis jar.

 On Fri, May 1, 2015 at 12:11 PM, Pat Ferrel p...@occamsmachete.com
 wrote:

  There is an assembly xml in
 mahout/spark/src/main/assembly/dependency-reduced.xml. It contains
 dependencies that are external to mahout but required for either the
 client
 or backend executor distributed code.

 Guava has recently been removed but scopt is still used by the client.
 For
 some reason the following artifacts were added to the assembly and I’m
 not
 sure why. This is only used with Spark.

Re: bringing back the fp-growth code in mahout

2015-04-29 Thread Ted Dunning

On Mon, Apr 27, 2015 at 8:13 PM, ray rtmel...@gmail.com wrote:

 What is the best way to tell if Apache code is being maintained, in
 particular the fp-growth algorithm in Spark's MLlib?


Ask on the appropriate mailing list.

Re: bringing back the fp-growth code in mahout

2015-04-27 Thread Ted Dunning

Ray,

Is the Spark implementation usable?  Is it maintained?  If not, there is a
decent reason to move forward.

I don't think that we want to revive the old map-reduce implementation.



On Mon, Apr 27, 2015 at 5:48 AM, ray rtmel...@gmail.com wrote:

 I had it in mind to volunteer to maintain the fp-growth code in Mahout,
 but I see that Spark has an fp-growth implementation.  So now that I have
 the time to work on this, I'm wondering if there is any point, or if there
 is still any interest in the Mahout community.

 If not, so be it.  If so, I volunteer.

 Regards, Ray.

Re: [jira] [Created] (BIGTOP-1831) Upgrade Mahout to 0.10

2015-04-26 Thread Ted Dunning

Yeah... things have changed pretty radically. There is a whole bunch of
new Scala based code.

On Sun, Apr 26, 2015 at 11:22 AM, Konstantin Boudnik c...@apache.org wrote:

Hey Andrew.

I believe the upgrade from 0.9 to 0.10 on our side should be simple enough.
Unless you guys have changed the structure of the build, or the build
system
itself or something similarly drastic. Do you have any input on this?

Thanks
Cos

P.S. Thanks for the slack channel - it might come handy!

On Fri, Apr 24, 2015 at 09:26PM, Andrew Musselman wrote:
I'm not educated enough in what has to happen but we're happy to help.

Are there things we need to do from the Mahout end or is it changing
recipes and doing regressions of BigTop builds, etc., what else?

On Friday, April 24, 2015, Konstantin Boudnik c...@apache.org wrote:

I am trying to see if anyone is doing the accomodation of 0.10 into
coming
1.0
release. That's pretty much a release blocker at this point. I am not
very
much concerned about Spark compat, but if we to take 0.10 into 1.0 it
needs to
work and be tested against 2.6.0 Hadoop.

So, does anyone works on the patch or this JIRA?

Cos

On Fri, Apr 24, 2015 at 05:48PM, Andrew Musselman wrote:
The spark 1.3 compat is in a near future release; what do you need
from
us
to make 1.1 and 1.2 compat work?

On Thursday, April 23, 2015, Konstantin Boudnik (JIRA)
j...@apache.org
javascript:;
wrote:

[

https://issues.apache.org/jira/browse/BIGTOP-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510075#comment-14510075
]

Konstantin Boudnik commented on BIGTOP-1831:

How is it going guys? Looks like this is one of the blockers for
1.0
as we
can not use old 0.9 version. Appreciate the help! Thank you!

Upgrade Mahout to 0.10
--

Key: BIGTOP-1831
URL:
https://issues.apache.org/jira/browse/BIGTOP-1831
Project: Bigtop
Issue Type: Task
Components: general
Affects Versions: 0.8.0
Reporter: David Starina
Priority: Blocker
Labels: Mahout
Fix For: 1.0.0

Need to upgrade Mahout to the latest 0.10 release (first Hadoop
2.x
compatible release)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Streaming and incremental cooccurrence

2015-04-24 Thread Ted Dunning

Sounds about right.

My guess is that memory is now large enough, especially on a cluster that
the cooccurrence will fit into memory quite often.  Taking a large example
of 10 million items and 10,000 cooccurrences each, there will be 100
billion cooccurrences to store which shouldn't take more than about half a
TB of data if fully populated.  This isn't that outrageous any more.  With
SSD's as backing store, even 100GB of RAM or less might well produce very
nice results.  Depending on incoming transaction rates, using spinning disk
as a backing store might also work with small memory.

Experiments are in order.



On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Ok, seems right.

 So now to data structures. The input frequency vectors need to be paired
 with each input interaction type and would be nice to have as something
 that can be copied very fast as they get updated. Random access would also
 be nice but iteration is not needed. Over time they will get larger as all
 items get interactions, users will get more actions and appear in more
 vectors (with multi-intereaction data). Seems like hashmaps?

 The cooccurrence matrix is more of a question to me. It needs to be
 updatable at the row and column level, and random access for both row and
 column would be nice. It needs to be expandable. To keep it small the keys
 should be integers, not full blown ID strings. There will have to be one
 matrix per interaction type. It should be simple to update the Search
 Engine to either mirror the matrix of use it directly for index updates.
 Each indicator update should cause an index update.

 Putting aside speed and size issues this sounds like a NoSQL DB table that
 is cached in-memeory.

 On Apr 23, 2015, at 3:04 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel p...@occamsmachete.com wrote:

  This seems to violate the random choice of interactions to cut but now
  that I think about it does a random choice really matter?
 

 It hasn't ever mattered such that I could see.  There is also some reason
 to claim that earliest is best if items are very focussed in time.  Of
 course, the opposite argument also applies.  That leaves us with empiricism
 where the results are not definitive.

 So I don't think that it matters, but I don't think that it does.

Re: Streaming and incremental cooccurrence

2015-04-23 Thread Ted Dunning

On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel p...@occamsmachete.com wrote:

 This seems to violate the random choice of interactions to cut but now
 that I think about it does a random choice really matter?


It hasn't ever mattered such that I could see.  There is also some reason
to claim that earliest is best if items are very focussed in time.  Of
course, the opposite argument also applies.  That leaves us with empiricism
where the results are not definitive.

So I don't think that it matters, but I don't think that it does.

Re: Streaming and incremental cooccurrence

2015-04-22 Thread Ted Dunning

On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I think we have been talking about an idea that does an incremental
 approximation, then a refresh every so often to remove any approximation so
 in an ideal world we need both.


Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.

Re: Streaming and incremental cooccurrence

2015-04-19 Thread Ted Dunning

Inline

On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Short answer, you are correct this is not a new filter.

 The Hadoop MapReduce implements:
 * maxSimilaritiesPerItem
 * maxPrefs
 * minPrefsPerUser
 * threshold

 Scala version:
 * maxSimilaritiesPerItem


I think of this as column-wise, but that may be bad terminology.


 * maxPrefs


And I think of this as row-wise or user limit.  I think it is the
interaction-cut from the paper.



 The paper talks about an interaction-cut, and describes it with There is
 no significant decrease in the error for incorporating more interactions
 from the ‘power users’ after that.” While I’d trust your reading better
 than mine I thought that meant dowsampling overactive users.


I agree.




 However both the Hadoop Mapreduce and the Scala version downsample both
 user and item interactions by maxPrefs. So you are correct, not a new thing.

 The paper also talks about the threshold and we’ve talked on the list
 about how better to implement that. A fixed number is not very useful so a
 number of sigmas was proposed but is not yet implemented.


I think that both  minPrefsPerUser and threshold have limited utility in
the current code.  Could be wrong about that.

With low quality association measures that suffer from low count problems
or simplisitic user-based methods, minPrefsPerUser can be crucial.
Threshold can also be required for systems like that.

The Scala code doesn't have that problem since it doesn't support those
metrics.

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Ted Dunning

Andrew

Take a look at the slides I posted. In them I showed that the update does not
grow beyond a very reasonable bound.

Sent from my iPhone

On Apr 18, 2015, at 9:15, Andrew Musselman andrew.mussel...@gmail.com wrote:

Yes that's what I mean; if the number of updates gets too big it probably
would be unmanageable though. This approach worked well with daily
updates, but never tried it with anything real time.

On Saturday, April 18, 2015, Pat Ferrel p...@occamsmachete.com wrote:

I think you are saying that instead of val newHashMap = lastHashMap ++
updateHashMap, layered updates might be useful since new and last are
potentially large. Some limit of updates might trigger a refresh. This
might work if the update works with incremental index updates in the search
engine. Given practical considerations the updates will be numerous and
nearly empty.

On Apr 17, 2015, at 7:58 PM, Andrew Musselman andrew.mussel...@gmail.com
javascript:; wrote:

I have not implemented it for recommendations but a layered cache/sieve
structure could be useful.

That is, between batch refreshes you can keep tacking on new updates in a
cascading order so values that are updated exist in the newest layer but
otherwise the lookup goes for the latest updated layer.

You can put a fractional multiplier on older layers for aging but again
I've not implemented it.

On Friday, April 17, 2015, Ted Dunning ted.dunn...@gmail.com
javascript:; wrote:

Yes. Also add the fact that the nano batches are bounded tightly in size
both max and mean. And mostly filtered away anyway.

Aging is an open question. I have never seen any effect of alternative
sampling so I would just assume keep oldest which just tosses more
samples. Then occasionally rebuild from batch if you really want aging to
go right.

Search updates any more are true realtime also so that works very well.

Sent from my iPhone

On Apr 17, 2015, at 17:20, Pat Ferrel p...@occamsmachete.com
javascript:;
javascript:; wrote:

Thanks.

This idea is based on a micro-batch of interactions per update, not
individual ones unless I missed something. That matches the typical input
flow. Most interactions are filtered away by frequency and number of
interaction cuts.

A couple practical issues

In practice won’t this require aging of interactions too? So wouldn’t
the update require some old interaction removal? I suppose this might
just
take the form of added null interactions representing the geriatric ones?
Haven’t gone through the math with enough detail to see if you’ve already
accounted for this.

To use actual math (self-join, etc.) we still need to alter the geometry
of the interactions to have the same row rank as the adjusted total. In
other words the number of rows in all resulting interactions must be the
same. Over time this means completely removing rows and columns or
allowing
empty rows in potentially all input matrices.

Might not be too bad to accumulate gaps in rows and columns. Not sure if
it would have a practical impact (to some large limit) as long as it was
done, to keep the real size more or less fixed.

As to realtime, that would be under search engine control through
incremental indexing and there are a couple ways to do that, not a
problem
afaik. As you point out the query always works and is real time. The
index
update must be frequent and not impact the engine's availability for
queries.

On Apr 17, 2015, at 2:46 PM, Ted Dunning ted.dunn...@gmail.com
javascript:;
javascript:; wrote:

When I think of real-time adaptation of indicators, I think of this:
http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime

On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel p...@occamsmachete.com
javascript:;
javascript:; wrote:
I’ve been thinking about Streaming (continuous input) and incremental
coccurrence.

As interactions stream in from the user it it fairly simple to use
something like Spark streaming to maintain a moving time window for all
input, and an update frequency that recalcs all input currently in the
time
window. I’ve done this with the current cooccurrence code but though
streaming, this is not incremental.

The current data flow goes from interaction input to geometry and user
dictionary reconciliation to A’A, A’B etc. After the multiply the
resulting
cooccurrence matrices are LLR weighted/filtered/down-sampled.

Incremental can mean all sorts of things and may imply different
trade-offs. Did you have anything specific in mind?

Re: Structure-based a %*% b optimization results.

2015-04-18 Thread Ted Dunning

Sadly, no, since that was from a different job.

But here are some references with snippets:

This one indicates that things have changed dramatically even just from
2009:
http://www.cs.cornell.edu/~bindel/class/cs6210-f12/notes/lec02.pdf

This next is a web aside from a pretty good looking book [1]
http://csapp.cs.cmu.edu/2e/waside/waside-blocking.pdf

I would guess that Samsara's optimizer could well do blocking as well as
the transpose transformations that Dmitriy is talking about.


[1] http://csapp.cs.cmu.edu/



On Fri, Apr 17, 2015 at 10:24 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Ted you have any sample code snippets?

 On Friday, April 17, 2015, Ted Dunning ted.dunn...@gmail.com wrote:

 
  This does look good.
 
  One additional thought would be to do a standard multi-level blocking
  implementation of matrix times. In my experience this often makes
  orientation much less important.
 
  The basic reason is that dense times requires n^3 ops but only n^2 memory
  operations.  By rearranging the loops you get reuse in registers and then
  reuse in L1 and L2.
 
  The win that you are getting now is due to cache lines being fully used
  rather than partially used and then lost before they are touched again.
 
  The last time I did this, there were only three important caching layers.
  Registers. Cache. Memory. There might be more now.  Done well, this used
 to
  buy 10x speed.  Might even buy more, especially with matrices that blow
 L2
  or even L3.
 
  Sent from my iPhone
 
   On Apr 17, 2015, at 17:26, Dmitriy Lyubimov dlie...@gmail.com
  javascript:; wrote:
  
   Spent an hour on this today.
  
   What i am doing: simply reimplementing pairwise dot-product algorithm
 in
   stock dense matrix times().
  
   However, equipping every matrix with structure flavor (i.e.
 dense(...)
   reports row-wise ,  and dense(...).t reports column wise, dense().t.t
   reports row-wise again, etc.)
  
   Next, wrote a binary operator that switches on combination of operand
   orientation and flips misaligned operand(s) (if any) to match most
  speedy
   orientation RW-CW. here are result for 300x300 dense matrix pairs:
  
   Ad %*% Bd: (107.125,46.375)
   Ad' %*% Bd: (206.475,39.325)
   Ad %*% Bd': (37.2,42.65)
   Ad' %*% Bd': (100.95,38.025)
   Ad'' %*% Bd'': (120.125,43.3)
  
   these results are for transpose combinations of original 300x300 dense
   random matrices, averaged over 40 runs (so standard error should be
 well
   controlled), in ms. First number is stock times() application (i.e.
 what
   we'd do with %*% operator now), and second number is ms with rewriting
   matrices into RW-CW orientation.
  
   For example, AB reorients B only, just like A''B'', AB' reorients
  nothing,
   and worst case A'B re-orients both (I also tried to run sum of outer
   products for A'B case without re-orientation -- apparently L1 misses
 far
   outweigh costs of reorientation there, i got very bad results there for
   outer product sum).
  
   as we can see, stock times() version does pretty bad for even dense
   operands for any orientation except for the optimal.
  
   Given that, i am inclined just to add orientation-driven structure
   optimization here and replace all stock calls with just orientation
   adjustment.
  
   Of course i will need to append this matrix with sparse and sparse row
   matrix combination (quite a bit of those i guess) and see what happens
   compared to stock sparse multiplications.
  
   But even that seems like a big win to me (basically, just doing
   reorientation optimization seems to give 3x speed up on average in
   matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Ted Dunning

On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel p...@occamsmachete.com wrote:

 You seem to be proposing a new cut by frequency of item interaction, is
 this correct? This is because the frequency is known before the multiply
 and LLR. I assume the #2 cut is left in place?


Yes.  but I didn't think it was new.

Re: Additional Travis-CI Capacity

2015-04-17 Thread Ted Dunning

It is a piece of cake for simple builds.

It required setting up a config file that is seen by travis ci on the
github repo.

If you use a maven build, this is dead simple.  Here, for instance, is the
entire config for the t-digest process from the .travis.yml file:

language: java
jdk:
   - oraclejdk7
   - openjdk7

I had to tell travis to look at the project but that was it.  Much simpler
than, say, Jenkins.  Bound to be less flexible as well, but if it does what
I want and is more reliable because of fewer corner cases, how bad can it
be to lose flexibility that I wouldn't use?




On Fri, Apr 17, 2015 at 3:28 AM, Andrew Musselman a...@apache.org wrote:

 We're asking ourselves the same thing on dev@mahout.

 On Thursday, April 16, 2015, Konstantin Boudnik c...@apache.org wrote:

  How much work it is to re-implement everything in the new platform?
 Anyone
  has
  any experience with it?
 
  Cos
 
  On Thu, Apr 16, 2015 at 05:20PM, Roman Shaposhnik wrote:
   Is this something that we may want to look at?
  
   Thanks,
   Roman.
  
  
   -- Forwarded message --
   From: David Nalley da...@gnsa.us javascript:;
   Date: Wed, Apr 15, 2015 at 3:33 PM
   Subject: Additional Travis-CI Capacity
   To: bui...@apache.org javascript:; bui...@apache.org
 javascript:;
  
  
  
   FYI:
  
   https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci

Re: Streaming and incremental cooccurrence

2015-04-17 Thread Ted Dunning

When I think of real-time adaptation of indicators, I think of this:

http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime


On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I’ve been thinking about Streaming (continuous input) and incremental
 coccurrence.

 As interactions stream in from the user it it fairly simple to use
 something like Spark streaming to maintain a moving time window for all
 input, and an update frequency that recalcs all input currently in the time
 window. I’ve done this with the current cooccurrence code but though
 streaming, this is not incremental.

 The current data flow goes from interaction input to geometry and user
 dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
 cooccurrence matrices are LLR weighted/filtered/down-sampled.

 Incremental can mean all sorts of things and may imply different
 trade-offs. Did you have anything specific in mind?

Re: [VOTE] Add Travis-CI for Mahout

2015-04-17 Thread Ted Dunning

On Fri, Apr 17, 2015 at 6:32 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Doesn’t Apache have some draconian requirement to control all bits of the
 project pipeline and workflow?


No.  Apache has a strict policy about *hosting* all of the bits that users
of the software consume.  That means that authoritative version history.
And the released bits.  Using outside tools either automated or manual is a
fine thing.

Re: Structure-based a %*% b optimization results.

2015-04-17 Thread Ted Dunning


This does look good. 

One additional thought would be to do a standard multi-level blocking 
implementation of matrix times. In my experience this often makes orientation 
much less important. 

The basic reason is that dense times requires n^3 ops but only n^2 memory 
operations.  By rearranging the loops you get reuse in registers and then reuse 
in L1 and L2.  

The win that you are getting now is due to cache lines being fully used rather 
than partially used and then lost before they are touched again. 

The last time I did this, there were only three important caching layers.  
Registers. Cache. Memory. There might be more now.  Done well, this used to buy 
10x speed.  Might even buy more, especially with matrices that blow L2 or even 
L3. 

Sent from my iPhone

 On Apr 17, 2015, at 17:26, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
 Spent an hour on this today.
 
 What i am doing: simply reimplementing pairwise dot-product algorithm in
 stock dense matrix times().
 
 However, equipping every matrix with structure flavor (i.e. dense(...)
 reports row-wise ,  and dense(...).t reports column wise, dense().t.t
 reports row-wise again, etc.)
 
 Next, wrote a binary operator that switches on combination of operand
 orientation and flips misaligned operand(s) (if any) to match most speedy
 orientation RW-CW. here are result for 300x300 dense matrix pairs:
 
 Ad %*% Bd: (107.125,46.375)
 Ad' %*% Bd: (206.475,39.325)
 Ad %*% Bd': (37.2,42.65)
 Ad' %*% Bd': (100.95,38.025)
 Ad'' %*% Bd'': (120.125,43.3)
 
 these results are for transpose combinations of original 300x300 dense
 random matrices, averaged over 40 runs (so standard error should be well
 controlled), in ms. First number is stock times() application (i.e. what
 we'd do with %*% operator now), and second number is ms with rewriting
 matrices into RW-CW orientation.
 
 For example, AB reorients B only, just like A''B'', AB' reorients nothing,
 and worst case A'B re-orients both (I also tried to run sum of outer
 products for A'B case without re-orientation -- apparently L1 misses far
 outweigh costs of reorientation there, i got very bad results there for
 outer product sum).
 
 as we can see, stock times() version does pretty bad for even dense
 operands for any orientation except for the optimal.
 
 Given that, i am inclined just to add orientation-driven structure
 optimization here and replace all stock calls with just orientation
 adjustment.
 
 Of course i will need to append this matrix with sparse and sparse row
 matrix combination (quite a bit of those i guess) and see what happens
 compared to stock sparse multiplications.
 
 But even that seems like a big win to me (basically, just doing
 reorientation optimization seems to give 3x speed up on average in
 matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).

Re: [VOTE] Add Travis-CI for Mahout

2015-04-16 Thread Ted Dunning

I use it for t-digest and like it a lot.

There are some strict bounds on how much resource you are supposed to
consume.  Mileage may vary.

On Fri, Apr 17, 2015 at 12:23 AM, Suneel Marthi suneel.mar...@gmail.com
wrote:

 Would this be an additional CI we would like to add to Mahout ?

 https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci

 I am up for it.

 +1

Re: Next version

2015-04-14 Thread Ted Dunning

A word of warning about making decisions off-list and without a permanent
record on the mailing list.

I will likely be available, but may not be.  I am happy with whatever the
consensus is (with a tilt towards frequent releases), but would like to see
most of the decision process on the list.


On Tue, Apr 14, 2015 at 4:44 AM, Suneel Marthi suneel.mar...@gmail.com
wrote:

 We should talk about this. Could the team slack tomorrow 1PM Eastern Time
 to talk this out and also finalize scope for the next one?

 On Mon, Apr 13, 2015 at 9:14 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  i thought we wanted to do 0.10.1 with a quicker release cycle and
 bugfixes?
 
  On Sun, Apr 12, 2015 at 6:47 AM, Suneel Marthi suneel.mar...@gmail.com
  wrote:
 
   On Sun, Apr 12, 2015 at 8:56 AM, Stevo Slavić ssla...@gmail.com
 wrote:
  
Hello team,
   
Should next version be 0.10.1 or 0.11.0?
   
  
   I am fine with just 0.11
  
  
   
Thinking maybe 0.11.0 is more suitable, if it's going to contain
  artifact
name changes like MAHOUT-1680 and MAHOUT-1681, and fundamental new
features, so we keep minor releases for backward compatible bug fix
releases only.
   
Btw, it would be good (whoever has privileges) to have versions in
 JIRA
project sorted out:
- mark 0.10.0 as released
- remove two empty 1.0-snapshot versions
- move 1.0 to the top and clear its release date
- move 0.10.1/0.11.0 under 1.0 and after 0.10.0
   
  
   Stevo, u should have permissions now to fix all of the above.
  
  
- maybe plan and set 0.10.1/0.11.0 expected release date (Suneel was
mentioning it would be nice to integrate with Apache Flink by October
timely for http://lanyrd.com/2015/flink-forward/ )
   
  
   This would definitely be a good story to present at
   http://lanyrd.com/2015/flink-forward/
  
   The Flink team is ready to dedicate resources from their camp to work
  with
   us.
  
  
Kind regards,
Stevo Slavic.

Re: Next version

2015-04-14 Thread Ted Dunning

On Tue, Apr 14, 2015 at 8:49 AM, Stevo Slavić ssla...@gmail.com wrote:

 I'm not sure but I doubt there's anything in Apache way of doing things,
 that's preventing us from having both 0.10.1 and 0.11.0 releases planned
 and worked on in parallel with dedicated branches e.g. master for next
 major.minor/non-bug-fix release, and branches for bug fix supported
 versions like 0.10 or 0.10.x. One can create a 0.10.x branch from 0.10.0
 release tag. Changes there have to be regularly merged to master.


This is entirely up to the project from the Apache view point.

(and speaking as a project member, it sounds like a good idea)

Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-11 Thread Ted Dunning

Quick reminder for the next release:

It is important that at least one set of eyes examine the licensing aspects
of the release.  This includes running RAT, making sure that bigs and bobs
are named accurately and that the NOTICE and LICENSE files are correct.

We should have different people check different things next time.



On Sat, Apr 11, 2015 at 11:25 AM, Suneel Marthi suneel.mar...@gmail.com
wrote:

 Thanks everyone. We have had 5  +1 votes from the PMC and this release has
 passed and the Voting officially closes.
 Will send a formal release announcement once the release is finalized.

 Thanks again.

 On Sat, Apr 11, 2015 at 12:20 PM, Pat Ferrel p...@occamsmachete.com
 wrote:

  Just built an external app using sbt against the staging repo and it
 looks
  good to me
 
  +1 (binding)
 
  On Apr 11, 2015, at 9:12 AM, Andrew Palumbo ap@outlook.com wrote:
 
  After testing examples locally from .tar and .zip distribution and
 testing
  the staged mahout-math artifact in a java application, I am happy with
 this
  release.
 
  +1 (binding)
  On 04/11/2015 11:45 AM, Suneel Marthi wrote:
   After checking the {source} * {tar,zip} and running a few tests
 locally,
  I
   am fine with this release.
  
   +1 (binding)
  
   On Sat, Apr 11, 2015 at 11:43 AM, Andrew Musselman 
   andrew.mussel...@gmail.com wrote:
  
   After checking the binary tarball and zip, and running through all the
   examples on an EMR cluster, I am good with this release.
  
   +1 (binding)
  
   On Fri, Apr 10, 2015 at 9:34 PM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
   Ah... forgot this.
  
   +1 (binding)
  
   On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com
 
   wrote:
  
   I downloaded and tested the signatures and check-sums on
   {binary,source}
   x
   {zip,tar} + pom.  All were correct.
  
   One thing that I worry a little about is that the name of the
 artifact
   doesn't include apache.  Not sure that is a hard requirement, but
 it
   seems a good thing to do.
  
  
  
   On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi 
   suneel.mar...@gmail.com
   wrote:
  
   Here's a new Mahout 0.10.0 Release Candidate at
  
  
  
  https://repository.apache.org/content/repositories/orgapachemahout-1007/
   The Voting for this ends on tomorrow.  Need atleast 3 PMC +1 for
 the
   release to pass.
  
   Grant, Ted:  Would appreciate if u guys could verify the
 signatures.
  
  
   Rest: Please test the artifacts.
  
   Thanks to all the contributors and committers.
  
   Regards,
   Suneel
  
   On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel 
 p...@occamsmachete.com
   wrote:
  
   Ran well but we have a packaging problem with the binary distro.
   Will
   require either a pom or code change I think, hold the vote.
  
  
  
   On Apr 9, 2015, at 4:31 PM, Andrew Musselman 
   andrew.mussel...@gmail.com
   wrote:
  
   Running on EMR now.
  
   On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel p...@occamsmachete.com
 
   wrote:
   I can't run it (due to messed up dev machine) but I verified the
   artifacts
   buildiing an external app with sbt using the staged repo instead
   of
   my
   local .m2 cache. This means the Scala classes were resolved
   correctly
   from
   the artifacts.
  
   Hope someone can actually run it on a cluster
  
  
   On Apr 9, 2015, at 2:42 PM, Suneel Marthi 
   suneel.mar...@gmail.com
   wrote:
   Please find the Mahout 0.10.0 release candidate at
  
  
  https://repository.apache.org/content/repositories/orgapachemahout-1005/
   The Voting runs till Saturday, April 11 2015, need atleast 3 PMC
   +1
   votes
   for the candidate release to pass.
  
   Thanks again to all the commiters and contributors for their hard
   work
   over
   the past few weeks.
  
   Regards,
   Suneel
   On Behalf of Apache Mahout Team

Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-10 Thread Ted Dunning

I downloaded and tested the signatures and check-sums on {binary,source} x
{zip,tar} + pom.  All were correct.

One thing that I worry a little about is that the name of the artifact
doesn't include apache.  Not sure that is a hard requirement, but it
seems a good thing to do.



On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi suneel.mar...@gmail.com
wrote:

 Here's a new Mahout 0.10.0 Release Candidate at

 https://repository.apache.org/content/repositories/orgapachemahout-1007/

 The Voting for this ends on tomorrow.  Need atleast 3 PMC +1 for the
 release to pass.

 Grant, Ted:  Would appreciate if u guys could verify the signatures.


 Rest: Please test the artifacts.

 Thanks to all the contributors and committers.

 Regards,
 Suneel

 On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel p...@occamsmachete.com
 wrote:

  Ran well but we have a packaging problem with the binary distro. Will
  require either a pom or code change I think, hold the vote.
 
 
 
  On Apr 9, 2015, at 4:31 PM, Andrew Musselman andrew.mussel...@gmail.com
 
  wrote:
 
  Running on EMR now.
 
  On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
   I can't run it (due to messed up dev machine) but I verified the
  artifacts
   buildiing an external app with sbt using the staged repo instead of my
   local .m2 cache. This means the Scala classes were resolved correctly
  from
   the artifacts.
  
   Hope someone can actually run it on a cluster
  
  
   On Apr 9, 2015, at 2:42 PM, Suneel Marthi suneel.mar...@gmail.com
  wrote:
  
   Please find the Mahout 0.10.0 release candidate at
  
 https://repository.apache.org/content/repositories/orgapachemahout-1005/
  
   The Voting runs till Saturday, April 11 2015, need atleast 3 PMC +1
 votes
   for the candidate release to pass.
  
   Thanks again to all the commiters and contributors for their hard work
  over
   the past few weeks.
  
   Regards,
   Suneel
   On Behalf of Apache Mahout Team

Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-10 Thread Ted Dunning

Ah... forgot this.

+1 (binding)

On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:


 I downloaded and tested the signatures and check-sums on {binary,source} x
 {zip,tar} + pom.  All were correct.

 One thing that I worry a little about is that the name of the artifact
 doesn't include apache.  Not sure that is a hard requirement, but it
 seems a good thing to do.



 On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi suneel.mar...@gmail.com
 wrote:

 Here's a new Mahout 0.10.0 Release Candidate at

 https://repository.apache.org/content/repositories/orgapachemahout-1007/

 The Voting for this ends on tomorrow.  Need atleast 3 PMC +1 for the
 release to pass.

 Grant, Ted:  Would appreciate if u guys could verify the signatures.


 Rest: Please test the artifacts.

 Thanks to all the contributors and committers.

 Regards,
 Suneel

 On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel p...@occamsmachete.com
 wrote:

  Ran well but we have a packaging problem with the binary distro. Will
  require either a pom or code change I think, hold the vote.
 
 
 
  On Apr 9, 2015, at 4:31 PM, Andrew Musselman 
 andrew.mussel...@gmail.com
  wrote:
 
  Running on EMR now.
 
  On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
   I can't run it (due to messed up dev machine) but I verified the
  artifacts
   buildiing an external app with sbt using the staged repo instead of my
   local .m2 cache. This means the Scala classes were resolved correctly
  from
   the artifacts.
  
   Hope someone can actually run it on a cluster
  
  
   On Apr 9, 2015, at 2:42 PM, Suneel Marthi suneel.mar...@gmail.com
  wrote:
  
   Please find the Mahout 0.10.0 release candidate at
  
 https://repository.apache.org/content/repositories/orgapachemahout-1005/
  
   The Voting runs till Saturday, April 11 2015, need atleast 3 PMC +1
 votes
   for the candidate release to pass.
  
   Thanks again to all the commiters and contributors for their hard work
  over
   the past few weeks.
  
   Regards,
   Suneel
   On Behalf of Apache Mahout Team

Re: Professional services

2015-04-03 Thread Ted Dunning

Actually, I should change my line to:

MapR Technologiessa...@maprtech.comFull commercial support

On Fri, Apr 3, 2015 at 4:58 PM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 Anyone else want their contact info on this page?

 Frank, what URL would you like to use; that one 404s.

[jira] [Reopened] (MAHOUT-1672) Update OnlineSummarizer to use the new T-Digest

2015-04-03 Thread Ted Dunning (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning reopened MAHOUT-1672:
-

If I get a 3.1 release out before Sunday, I would like to use that.  No code 
changes will be required, just the pom.

 Update OnlineSummarizer to use the new T-Digest 
 

 Key: MAHOUT-1672
 URL: https://issues.apache.org/jira/browse/MAHOUT-1672
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.9
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Priority: Trivial
 Fix For: 0.10.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1668) Automate release process

2015-04-02 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393127#comment-14393127
 ] 

Ted Dunning commented on MAHOUT-1668:
-

Signing cannot be done on shared hardware that you don't control.  That still 
leaves a vat of stuff that can be done by the automated system, but you need 
some way for the release manager to verify that the bits in the release are 
exactly what is expected.





 Automate release process
 

 Key: MAHOUT-1668
 URL: https://issues.apache.org/jira/browse/MAHOUT-1668
 Project: Mahout
  Issue Type: Task
Reporter: Stevo Slavic
Assignee: Stevo Slavic
 Fix For: 0.10.0


 0.10.0 will be first release since project switched to git. Some changes have 
 to be made in build scripts to support the release process, the Apache way. 
 As consequence, how-to-make-release docs will likely need to be updated as 
 well. Also, it would be nice to automate release process as much as possible, 
 e.g. via dedicated Jenkins build job(s), so it's easy for any committer to 
 cut out a release for vote, and after vote either finalize release or easily 
 make a new RC - this will enable us to release faster and more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: MapR repo might need to be updated

2015-03-30 Thread Ted Dunning

(moving dev@mahout to bcc since this is not of widespread interest)

Stevo,

Here is what our builds guy says:

Our version of nexus is 2.3.1.  The last update to the repo was Friday.
 Because the error listed a cookie issue, I restarted apache. I have two
 builds building right now and pulling from the repo, no issues, yet.


Can you say if the problem persists?



On Mon, Mar 30, 2015 at 2:34 PM, Stevo Slavić ssla...@gmail.com wrote:

 Hello Ted,

 MapR Maven repository manager, seems to be Nexus, and it seems to be
 version 2.11.1 or older with this bug still in it:
 https://issues.sonatype.org/browse/NEXUS-7877

 Mahout build uses MapR Maven repository, and for all artifacts/dependencies
 resolved from it, build output is polluted with warnings like:


 Downloading:
 http://repository.mapr.com/maven/org/apache/apache/16/apache-16.pom
 Mar 30, 2015 11:20:48 PM

 org.apache.maven.wagon.providers.http.httpclient.client.protocol.ResponseProcessCookies
 processCookies
 WARNING: Cookie rejected [rememberMe=deleteMe, version:0, domain:
 repository.mapr.com, path:/nexus, expiry:Mon Mar 30 23:20:48 CEST 2015]
 Illegal path attribute /nexus. Path of origin:
 /maven/org/apache/apache/16/apache-16.pom


 Please consider having it updated.

 Kind regards,
 Stevo Slavic.

Re: Anyone using eclipse?

2015-03-30 Thread Ted Dunning

Idea here as well.



On Mon, Mar 30, 2015 at 4:52 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Idea here

 On Mon, Mar 30, 2015 at 4:42 PM, Andrew Palumbo ap@outlook.com
 wrote:

  also using idea
 
 
  On 03/30/2015 07:18 PM, Dmitriy Lyubimov wrote:
 
  I switched to idea since i started doing mixed projects with scala.
  Standalone scala is bearable in eclipse but mixed projects simply don't
  work. (and Mahout likely one of them).
 
  On Mon, Mar 30, 2015 at 3:58 PM, Suneel Marthi suneel.mar...@gmail.com
 
  wrote:
 
   I believe its only Shannon from amongst the committer team who is using
  Eclipse. I am talking him out into shifting to IntelliJ.
 
  On Mon, Mar 30, 2015 at 6:54 PM, Stevo Slavić ssla...@gmail.com
 wrote:
 
   Hello team,
 
  I'm curious, is anyone of you using eclipse IDE?
  If not, then as part of MAHOUT-1278 I could remove a lot from our
 POMs.
 
  Kind regards,
  Stevo Slavic.

Re: MapR repo might need to be updated

2015-03-30 Thread Ted Dunning

Thanks.

On it.

On Mon, Mar 30, 2015 at 2:34 PM, Stevo Slavić ssla...@gmail.com wrote:

 Hello Ted,

 MapR Maven repository manager, seems to be Nexus, and it seems to be
 version 2.11.1 or older with this bug still in it:
 https://issues.sonatype.org/browse/NEXUS-7877

 Mahout build uses MapR Maven repository, and for all artifacts/dependencies
 resolved from it, build output is polluted with warnings like:


 Downloading:
 http://repository.mapr.com/maven/org/apache/apache/16/apache-16.pom
 Mar 30, 2015 11:20:48 PM

 org.apache.maven.wagon.providers.http.httpclient.client.protocol.ResponseProcessCookies
 processCookies
 WARNING: Cookie rejected [rememberMe=deleteMe, version:0, domain:
 repository.mapr.com, path:/nexus, expiry:Mon Mar 30 23:20:48 CEST 2015]
 Illegal path attribute /nexus. Path of origin:
 /maven/org/apache/apache/16/apache-16.pom


 Please consider having it updated.

 Kind regards,
 Stevo Slavic.

Re: Require Java 7 and Hadoop 2.x?

2015-03-27 Thread Ted Dunning

There are subtle API incompatibilities.

Unfortunate.  But true.



On Fri, Mar 27, 2015 at 10:16 AM, Pat Ferrel p...@occamsmachete.com wrote:

 As I said in the other thread forcing Java 7 is not as big a deal as
 forcing Hadoop 1.2.1. Is there some new part of 2.X that we need? Or some
 forced API incompatability?


 On Mar 27, 2015, at 9:58 AM, Suneel Marthi suneel.mar...@gmail.com
 wrote:

 TED??? please jump in.

 On Fri, Mar 27, 2015 at 12:54 PM, Pat Ferrel p...@occamsmachete.com
 wrote:

  Aren’t current Mahout 0.9 users on hadoop 1.2.1 by definition? Probably
  most on Java 6 too.
 
  Unless there is some strong reason it seems like we should support both
 of
  those for at least one release, shouldn’t we?
 
  I have a hadoop 1.2.1 cluster, which has a hadoop job that is not Hadoop
 2
  compatible so I’m stuck there for the time being. Compiling Mahout for
 this
  now gives an error over the H2 API “isDirectory”, which I think used to
 be
  “isDir” for H1. Has that API been deprecated in H2? Are we forced to
 chose
  either/or?
 
 
  On Mar 27, 2015, at 9:31 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  It should, Hadoop supports it long term and lots of people stuck there
  with projects that haven’t been upgraded (Mahout comes to mind).
 
  On Mar 27, 2015, at 9:26 AM, Stevo Slavić ssla...@gmail.com wrote:
 
  Have to check but I doubt that build supports hadoop 1.x any more.
 
  On Fri, Mar 27, 2015 at 5:15 PM, Suneel Marthi suneel.mar...@gmail.com
  wrote:
 
  This is the Java version, gotta use Java 7
 
  On Fri, Mar 27, 2015 at 12:08 PM, Pat Ferrel p...@occamsmachete.com
  wrote:
 
  Latest source for Spark 1.1.0 and Hadoop 1.2.1
 
  Build complains about the move to
  maven.compiler.target1.7/maven.compiler.target I think this was
  upped
  from 1.6 but not sure if that’s what the error is about.
 
  I’m on Java 6 no this machine if that matters.
 
  Actual error:
 
  [INFO] Mahout Build Tools  SUCCESS
  [3.512s]
  [INFO] Apache Mahout . SUCCESS
  [0.603s]
  [INFO] Mahout Math ... FAILURE
  [6.453s]
  [INFO] Mahout MapReduce Legacy ... SKIPPED
  [INFO] Mahout Integration  SKIPPED
  [INFO] Mahout Examples ... SKIPPED
  [INFO] Mahout Release Package  SKIPPED
  [INFO] Mahout Math Scala bindings  SKIPPED
  [INFO] Mahout Spark bindings . SKIPPED
  [INFO] Mahout Spark bindings shell ... SKIPPED
  [INFO] Mahout H2O backend  SKIPPED
  [INFO]
 
 
  [INFO] BUILD FAILURE
  [INFO]
 
 
  [INFO] Total time: 11.609s
  [INFO] Finished at: Fri Mar 27 08:55:35 PDT 2015
  [INFO] Final Memory: 24M/310M
  [INFO]
 
 
  [ERROR] Failed to execute goal
  org.apache.maven.plugins:maven-compiler-plugin:3.2:compile
  (default-compile) on project mahout-math: Fatal error compiling:
 invalid
  target release: 1.7 - [Help 1]
  [ERROR]
  [ERROR] To see the full stack trace of the errors, re-run Maven with
 the
  -e switch.
  [ERROR] Re-run Maven using the -X switch to enable full debug logging.
  [ERROR]

Re: Release

2015-03-17 Thread Ted Dunning

That is great news.  Of course, Anand is doing that personally and doesn't
actually work for h2o.ai (formerly 0xdata).  It was the company that I
meant.

Apache contributors are individuals, of course, but having somebody be paid
for building contributions definitely helps with avoiding distractions like
finding groceries.


On Tue, Mar 17, 2015 at 6:35 PM, Suneel Marthi suneel.mar...@gmail.com
wrote:

 Who the heck said they have moved on? Anand confirmed just today that he
 would continue on h20 mahout integration.

 Sent from my iPhone

  On Mar 17, 2015, at 8:26 PM, todd rtmel...@gmail.com wrote:
 
  On 03/17/2015 12:49 PM, Ted Dunning wrote:
 
  I think it should be deprecated.  The H2O guys have moved on after the
  reception they got.
 
  They've moved on?
 
  This has got to be one of the most disappointing things I have read in a
 long time.

Re: Release

2015-03-17 Thread Ted Dunning

On Tue, Mar 17, 2015 at 10:14 AM, Pat Ferrel p...@occamsmachete.com wrote:

 I’m nervous releasing H2O with no one supporting it. Is anyone signing up
 for that?


I think it should be deprecated.  The H2O guys have moved on after the
reception they got.

Re: Mahout listed under Lucene category in Jira

2015-03-17 Thread Ted Dunning

This looks fixed.



On Tue, Mar 17, 2015 at 10:36 AM, Dyer, James james.d...@ingramcontent.com
wrote:

 Someone on the Lucene PMC noticed that Mahout JIRAs appear in our list at
 reporter.apache.org.  We think this might be because Mahout is still
 listed under the Lucene category in Jira.  (
 https://issues.apache.org/jira/secure/BrowseProjects.jspa#10150).

 Is there an admin who can change the Mahout project's category in Jira?

 Thank you!

 James Dyer
 Ingram Content Group

Re: Neural network contribution

2015-03-10 Thread Ted Dunning

Burak,

Sounds like a nice effort.  Mahout is focussed on implementations in Java
and lately Scala, not C.

There is another project, however, just entering incubation that might fit
much better.  That project is the Singa project.  The proposal is here:

http://wiki.apache.org/incubator/SingaProposal

I suggest that you contact Beng Chin Ooi (email on the proposal) to discuss
what you have in more detail.  The group that started the Singa project is
very good on neural networks and should be able to comment better than we
can.



On Tue, Mar 10, 2015 at 4:55 AM, burak sarac bu...@linux.com wrote:

 Hello all,
  Few months ago I have completed a small Neural Network application for
 study. I just met Mahout and I liked! I was also looking for Neural Network
 implementation to compare my implementation and couldnt find any. If I am
 not wrong is there any chance I can contribute with my project? With Andrew
 NG samples 5000 digit calculates in 400 ms on single core and 40 ms my 8
 core machine and 20 ms gpu. (Each iteration mostly does 1 calculation) I
 have used Fmincg implementation in C for optimization. At least you can do
 maybe code review for me? I will appreciate any feedback! Implementation in
 C. There is also more features which I didnt commit yet (improved feature
 scaling, using different hiddenlayer sizes per layer etc...)

 project here: https://github.com/buraksarac/NeuralNetwork
 https://github.com/buraksarac/NeuralNetwork

 main logic here :

 https://github.com/buraksarac/NeuralNetwork/blob/master/src/NeuralNetwork.cpp


 Thank you for your time!

 p.s. I have tried to send few emails I hope I didnt flood

 Burak Sarac

Re: What is Mahout?

2015-03-06 Thread Ted Dunning

+1 for keeping the name

-1 for incubation




On Thu, Feb 26, 2015 at 5:24 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Along with workspaces, code completion, +1 for visualization and extended
 (bayesian, stats, etc) ops. Anything that is scalable and general seems
 fair game.

 Also -1 for incubation.  This is all an evolution of loosely collected
 algos into generalizations and extensions of legacy stuff on new ground.

 Also +1 for separating out packages more formally—like
 spark-itemsimilarity and other things that aren’t general. They may come
 with generalized bits (like similarity) but have package like delivery
 mechanisms. We should be able to have something better than contrib,
 especially since these may come with math and core extensions generally
 useful. No need to separate that until the core is done.

 However a new identity would be a big boost to being able to communicate
 the new mission—and is it is a new mission.  If the issue is about support
 for legacy that doesn’t seem to be a problem. If we stay a top level
 project we can support legacy, in fact we have to.


 On Feb 25, 2015, at 6:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 -1 on incubation as well. The website and docs and user lists and this
 champion and mentor stuff, and logos and promotions for committers
 absolutely do not make any sense at this point. From what i hear, people
 are pretty busy without having that as it is. It would probably make more
 sense to take both Andrews :) and committers who actively pursue the
 programming environment vision to PMC and for people who feel that they
 have no valuable input for new philosophy of the project just go emeritus
 and give up their voting rights. Power of do, as they say.

 There's no major change in philosophy either -- mahout has been proclaiming
 scalable machine learning, which is what we will continue doing. Only
 doing it (hopefully) a bit easier and with new set of backend tools.

 I want to emphasize that i'd seek math environment status in more general
 sense: not just algebraic, but also connect this to stats, samplers,
 optimizers, (including bayesian opts), feature extractors, i.e. all basic
 big ml tools. Adapt Spark's DataFrame to these tools where appropriate.
 Viewing it as solely distributed algebra is a bit skewed away from reality.
 On private branches, i have previously developed a lot of that
 functionality (except for the visual stuff) and it is in practice very
 useful; it creates a common umbrella for people with R background.

 I would very much want to integrate something for visualization, as it is
 important for environment. Unfortunately, I don't see any mature science
 plotting for jvm stuff around. Scatter plots at best. I want at least to be
 able to plot 2d maps and KDEs in with contours or density levels. There are
 ways to visualize massive datasets (and their parts). See no tools for this
 around at all. Maybe some clever way to integrate with ggplot2 or shiny
 server? even that would've been better, even if it required 3rd party
 software installation, than nothing at all.

 I don't expect methodologies go to contrib, actually. Slightly different
 modules, maybe, but not so extreme as contrib.





 On Wed, Feb 25, 2015 at 5:18 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

  How much would be involved in changing the name of a top-level project?
 
  I'd prefer to avoid the overhead of going back into incubation.
 
  I agree 0.10 makes more sense.
 
  On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen sro...@gmail.com wrote:
 
  My $0.02:
 
  There is no shortage of algorithm libraries that are in some way
  runnable on Hadoop out there, and not as much easy-to-use distributed
  matrix operation libraries. I think it's more additive to the
  ecosystem to solve that narrow, and deep, linear algebra problem and
  really nail it. That's a pretty good 'identity' to claim. It seems
  like an appropriate scope.
 
  I do think the project has changed so much that it's more confusing to
  keep calling it Mahout than to change the name. I can't think of one
  person I've talked to about Mahout in the last 6 months that was not
  under the impression that what is in 0.9 has simply been ported to
  Spark. It's different enough that it could even be it's own incubator
  project (under a different name).
 
  The brand recognition is for the deprecated part so keeping that is
  almost the problem. It's not crazy to just change the name. Or even
  consider a re-incubation. It might give some latitude to more fully
  reboot.
 
  Releasing 1.0.0 on the other hand means committing to the APIs (and
  name) for some fairly new code and fairly soon. Given that this is
  sort of a 0.1 of a new project, going to 1.0 feels semantically wrong.
  But a release would be good. Personally I'd suggest 0.10.
 
  On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel p...@occamsmachete.com
  wrote:
  Looking back over the last year Mahout has gone through a lot

Re: PMML

2015-03-05 Thread Ted Dunning

PMML is a machine-to-machine mechanism, not intended really for human
consumption or production.  Based on XML, it is, of course, bloated, but
that doesn't really matter for readability since reading isn't the goal.

The vision of making models easy to transfer from system to system is nice,
but the reality has fallen far short, unfortunately.  The problem is that
systems often have special aspects that make it hard to replicate exact
actions from one system to another.  Having a textual format for numerical
data doesn't help.

Here, for instance, is a linear regression model that I created using R:

PMML version=4.2 xmlns=http://www.dmg.org/PMML-4_2; xmlns:xsi=
http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=
http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd;
 Header copyright=Copyright (c) 2015 tdunning description=Linear
Regression Model
  Extension name=user value=tdunning extender=Rattle/PMML/
  Application name=Rattle/PMML version=1.4/
  Timestamp2015-03-05 09:46:32/Timestamp
 /Header
 DataDictionary numberOfFields=4
  DataField name=y optype=continuous dataType=double/
  DataField name=x1 optype=continuous dataType=double/
  DataField name=x2 optype=continuous dataType=double/
  DataField name=x3 optype=continuous dataType=double/
 /DataDictionary
 RegressionModel modelName=Linear_Regression_Model
functionName=regression algorithmName=least squares
  MiningSchema
   MiningField name=y usageType=predicted/
   MiningField name=x1 usageType=active/
   MiningField name=x2 usageType=active/
   MiningField name=x3 usageType=active/
  /MiningSchema
  Output
   OutputField name=Predicted_y feature=predictedValue/
  /Output
  RegressionTable intercept=-0.000669089797102863
   NumericPredictor name=x1 exponent=1 coefficient=3.00018785681213/
   NumericPredictor name=x2 exponent=1
coefficient=-1.00362806356329/
   NumericPredictor name=x3 exponent=1
coefficient=0.998224481877296/
  /RegressionTable
 /RegressionModel
/PMML

This looks pretty reasonable (if verbose).   It takes 1.5kB to store a
model but this compresses to around 600 bytes.

More involved models are a different story.  I built a simple random forest
on the same data and simply conversion to PMML took several minutes.
Presumably the R package involved is kind of inefficient, but this still is
pretty daunting.  Manipulating the resulting PMML representation is
actually quite difficult.

Saving the random forest model ultimately resulted in a 50MB file.
Compression reduced that to about 6MB.  This is pretty massive for a fairly
simple model.




On Thu, Mar 5, 2015 at 4:25 AM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 I think keeping it simple is best, try implementing one or two models in
 XML and then get fancy if it makes sense.

 On Wednesday, March 4, 2015, Saikat Kanjilal sxk1...@hotmail.com wrote:

  Next question: Is the audience for PMML programmers or could it be folks
  that can script?  I'm wondering how this intersects with a simple spark
  like DSL , could Mahout implement an intersection between the two?  If
  there's interest I can go into examples.
 
  Sent from my iPhone
 
   On Mar 4, 2015, at 4:17 PM, Andrew Musselman 
 andrew.mussel...@gmail.com
  javascript:; wrote:
  
   Sure, those would be options.
  
   On Wed, Mar 4, 2015 at 3:41 PM, Saikat Kanjilal sxk1...@hotmail.com
  javascript:; wrote:
  
   Question, is there a way to introduce PMML with using a more
 lightweight
   format like yaml or json?
  
   Date: Wed, 4 Mar 2015 13:25:29 -0800
   Subject: Re: PMML
   From: andrew.mussel...@gmail.com javascript:;
   To: dev@mahout.apache.org javascript:;
  
   Yes, the limitations are often an issue for people doing things that
   aren't
   in the PMML spec yet; there could be room for suggesting new features
  in
   the spec by building them though, I suppose.
  
   Also agree that XML is a lousy/bloated way of representing stuff like
   this,
   but in the end it's just a choice of representation so there may be
   reason
   to use some other encoding and then provide an XML-export function.
  
   On Wed, Mar 4, 2015 at 11:42 AM, Dmitriy Lyubimov 
 dlie...@gmail.com
  javascript:;
   wrote:
  
   I am willing to +1 any contribution at this point.
  
   my previous company used pmml to serialize simple stuff, but i don't
   have first hand experience. Its flexibility is ultimately pretty
   limited, isn't it? And xml is ultimately a media which is too ugly
 and
   too verbose at the same time to represent models with any more or
 less
   decent number of parameters?
  
  
  
   On Tue, Mar 3, 2015 at 8:19 PM, Suneel Marthi 
  suneel.mar...@gmail.com javascript:;
  
   wrote:
   It makes sense to support PMML for classification and clustering
   tasks to
   be able to share and distribute trained models. Sean, Pat, Dmitriy
   and
   Ted
   please chime in.
  
   PMML support in Mahout was talked about for a long time now but
 never
   really got any traction to take off.
  
   +1 to

Re: Faster collections for a faster Mahout

2015-01-17 Thread Ted Dunning

What is the license on fastutils?  I seem to remember that it was GPL at
one time.


On Sat, Jan 17, 2015 at 2:34 PM, Sebastiano Vigna vi...@di.unimi.it wrote:

 Dear developers,
 I'm writing to suggest to improve significantly Mahout's speed by
 replacing the current, Colt-based collections with faster collections.
 These are results from benchmarks at java-performance.info comparing
 fastutil and Mahout in get operations (Mahout collections were not included
 in the java-performance.info tests):

 tests.maptests.primitive.MahoutMapTest (1) = 2176.118213996
 tests.maptests.primitive.FastUtilMapTest (1) = 782.852852799
 tests.maptests.primitive.MahoutMapTest (10) = 2630.1235654
 tests.maptests.primitive.FastUtilMapTest (10) = 1074.903566002
 tests.maptests.primitive.MahoutMapTest (100) = 3969.1322968
 tests.maptests.primitive.FastUtilMapTest (100) = 1940.7466792

 This is with fastutil 6.6.1, which is comparable in speed to Koloboke or
 the GS collections (the java-performance.info tests use an older, slower
 version), and, I believe, faster for the purposes of Mahout. Get operations
 in Mahout collections are 2-3x slower.

 I modified locally RandomAccessSparseVector to use fastutil, and run some
 of the VectorBenchmarks.

 0[main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create
 (copy) RandSparseVectormean   = 12.57us;   mean   = 64.88us;
 32935 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create
 (incrementally) RandSparseVector
 mean   = 31.77us;   mean   = 79.33us;
 244212 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Plus
 RandSparseVector
 mean   = 47.36us;   mean   = 101.63us;

 On the left you can find the fastutil timings, on the right the Mahout
 timings. The only case in which I saw a slowdown is for some dense/sparse
 products:

 429433 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Times
 Rand.fn(Dense)mean   = 78us;  mean   = 52.47us;

 but I think this is due to the different way removals are handled: Mahout
 uses tombstones (and thus slows down all subsequent operations), whereas
 fastutil does true deletions, which are slightly slower at remove time, but
 make subsequent operations faster. Also, iteration over a fastutil-based
 RandomAccessSparseVector is slowed down by having to return non-standard
 Element instances instead of Map.Entry instances (as fastutil or the JDK
 would do naturally).

 If you'd like to benchmark the speed at a high level, the one-file drop-in
 is included (you'll need to add fastutil 6.6.1 as a dependency to
 mahout-math). As I said, things can be improved by using a standard
 Map.Entry (Int2DoubleMap.Entry) instead of Element. But this is a more
 pervasive change.

 Ciao,

 seba




 PS: One caveat: presently fastutil does not shrink backing arrays, which
 might not be what you want. It will, however, from the next release.

Re: Questions about Minhash/SimHash methods

2015-01-11 Thread Ted Dunning


I just looked a little bit am have a few questions. 

First, these appear to be java implementations for a single machine. How 
scalable is that? How would it interact with the new math framework?  

Second there are a number of style issue like author tags, indentation and 
such, but what I find most troubling is an almost complete lack of javadoc and 
complete lack of comments about the origin of the algorithms being used or 
non-trivial comments about what is happening in the code.  I see comments on 
sections like update w. That doesn't say anything that the code doesn't say.  

Sent from my iPhone

 On Jan 10, 2015, at 1:45, Andrew Musselman andrew.mussel...@gmail.com wrote:
 
 Non-negative matrix factorization would be a good addition; if you can 
 include tests with your pull request it will help.
 
 Assuming this is your PR:  https://github.com/apache/mahout/pull/70
 
 Looking forward to more.
 
 On Jan 9, 2015, at 11:21 PM, 梁明强 mqliang031...@gmail.com wrote:
 
 Dear sir,
 
 Here is Liang Mingqiang, an undergraduate student, highly interested in 
 Recommender System and Mahout. I have implete Non-negative Matrix 
 Factorization(NMF) and Probabilistic Matrix Factorization(PMF) method and 
 pull request my code for further comment.
 
 I test my code on my computer using movielens dataset and get reasonable 
 result. Do I need to write and submit a test module for my code. Just 
 because I need dataset for my test, can I add some text files in the test 
 package?
 
 In addition, Binary Matrix Factorization seems(BMF) very interesting, I want 
 contribute my BMF code for Mahout in the next step. 
 
 Last, but not least, Minhash and SimHash are very popular and useful methods 
 in Recommender System. But I look through the source code of Mahout, there 
 seems no Minhash and SimHash method. Does it mean those methods haven't been 
 contributed or just because I haven't check the source code carefully. If 
 those two methods have benn contributed, is there anyone willing to tell me 
 the path. Thank you!
 
 
 Looking forward,
 
 Liang Mingqiang

Re: Questions about Minhash/SimHash methods

2015-01-11 Thread Ted Dunning

On Sun, Jan 11, 2015 at 6:51 PM, 梁明强 mqliang031...@gmail.com wrote:

 In addition, what you mean the new math framework here?


Mahout has a new math framework written in scala that parallelizes
mathematical operations.

Re: kmeans result is different from scikit-learn result with center points provided

2015-01-06 Thread Ted Dunning

Running this gist can be done using the following two lines of R, btw:

library(devtools)
source_url(
https://gist.githubusercontent.com/tdunning/e1575ad2043af732c219/raw/444514454a6f3b5fcbbcaa3f8a919b1965e07f16/Clustering%20is%20hard
)

You should see something like this as output:

SHA-1 hash of file is 2bc9bf7677d6d5b8b7aa1b1d49749574f5bd942e
$fail
[1] 96

$success
[1] 4

counts
1 2 3 4
4 71 22 3

On Mon, Jan 5, 2015 at 11:50 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Clustering is harder than you appear to think:

http://www.imsc.res.in/~meena/papers/kmeans.pdf

https://en.wikipedia.org/wiki/K-means_clustering

NP-hard problems are typically solved by approximation. K-means is a
great example. Only a few, relatively unrealistic, examples have solutions
apparent enough to be found reliably by diverse algorithms. For instance,
something as easy as Gaussian clusters with sd=1e-3 situated on 10 random
corners of a unit hypercube in 10 dimensional space will be clustered
differently by many algorithms unless multiple starts are used.

For instance see https://gist.github.com/tdunning/e1575ad2043af732c219
for an R script that demonstrates that R's standard k-means algorithms fail
over 95% of the time for this trivial input, occasionally splitting a
single cluster into three parts. Restarting multiple times doesn't fix the
problem ... it only makes it a bit more tolerable. This example shows how
even 90 restarts could fail for this particular problem.

On Mon, Jan 5, 2015 at 11:03 PM, Lee S sle...@gmail.com wrote:

But parameters and distance measure is the same. Only difference: Mahout
kmeans convergence is based on whether every cluster is convergenced.
scikit-learn is based on within-cluster sum of squared criterion.

2015-01-06 14:15 GMT+08:00 Ted Dunning ted.dunn...@gmail.com:

I don't think that data is sufficiently clusterable to expect a unique
solution.

Mean squared error would be a better measure of quality.

On Mon, Jan 5, 2015 at 10:07 PM, Lee S sle...@gmail.com wrote:

Data in thie link:

http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
.
I convert it to sequencefile with InputDriver.

2015-01-06 14:04 GMT+08:00 Ted Dunning ted.dunn...@gmail.com:

What kind of synthetic data did you use?

On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote:

Hi, I used the synthetic data to test the kmeans method.
And I write the code own to convert center points to sequecefiles.
Then I ran the kmeans with parameter( -i input -o output -c center
-x 3
-cd
1 -cl) ,
I compared the dumped clusteredPoints with the result of
scikit-learn
kmens
result, it's totally different. I'm very confused.

Does anybody ever run kmeans with center points provided and
compare
the
result with other ml-library?

Re: kmeans result is different from scikit-learn result with center points provided

2015-01-05 Thread Ted Dunning

I don't think that data is sufficiently clusterable to expect a unique
solution.

Mean squared error would be a better measure of quality.



On Mon, Jan 5, 2015 at 10:07 PM, Lee S sle...@gmail.com wrote:

 Data in thie link:

 http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
 .
 I convert it to sequencefile with InputDriver.

 2015-01-06 14:04 GMT+08:00 Ted Dunning ted.dunn...@gmail.com:

  What kind of synthetic data did you use?
 
 
 
  On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote:
 
   Hi, I used the synthetic data to test the kmeans method.
   And I write the code own to convert center points to sequecefiles.
   Then I ran the kmeans with parameter( -i input -o output -c center -x 3
  -cd
   1  -cl) ,
   I compared the dumped clusteredPoints with the result of scikit-learn
  kmens
   result, it's totally different. I'm very confused.
  
   Does anybody ever run kmeans with center points provided and compare
 the
   result with other ml-library?

Re: kmeans result is different from scikit-learn result with center points provided

2015-01-05 Thread Ted Dunning

Clustering is harder than you appear to think:

http://www.imsc.res.in/~meena/papers/kmeans.pdf

https://en.wikipedia.org/wiki/K-means_clustering

NP-hard problems are typically solved by approximation.  K-means is a great
example.  Only a few, relatively unrealistic, examples have solutions
apparent enough to be found reliably by diverse algorithms.  For instance,
something as easy as Gaussian clusters with sd=1e-3 situated on 10 random
corners of a unit hypercube in 10 dimensional space will be clustered
differently by many algorithms unless multiple starts are used.

For instance see https://gist.github.com/tdunning/e1575ad2043af732c219 for
an R script that demonstrates that R's standard k-means algorithms fail
over 95% of the time for this trivial input, occasionally splitting a
single cluster into three parts.  Restarting multiple times doesn't fix the
problem ... it only makes it a bit more tolerable.  This example shows how
even 90 restarts could fail for this particular problem.





On Mon, Jan 5, 2015 at 11:03 PM, Lee S sle...@gmail.com wrote:

 But parameters and distance measure is the same. Only difference: Mahout
 kmeans convergence is based on whether every cluster is convergenced.
 scikit-learn is based on  within-cluster sum of squared criterion.

 2015-01-06 14:15 GMT+08:00 Ted Dunning ted.dunn...@gmail.com:

  I don't think that data is sufficiently clusterable to expect a unique
  solution.
 
  Mean squared error would be a better measure of quality.
 
 
 
  On Mon, Jan 5, 2015 at 10:07 PM, Lee S sle...@gmail.com wrote:
 
   Data in thie link:
  
  
 
 http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
   .
   I convert it to sequencefile with InputDriver.
  
   2015-01-06 14:04 GMT+08:00 Ted Dunning ted.dunn...@gmail.com:
  
What kind of synthetic data did you use?
   
   
   
On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote:
   
 Hi, I used the synthetic data to test the kmeans method.
 And I write the code own to convert center points to sequecefiles.
 Then I ran the kmeans with parameter( -i input -o output -c center
  -x 3
-cd
 1  -cl) ,
 I compared the dumped clusteredPoints with the result of
 scikit-learn
kmens
 result, it's totally different. I'm very confused.

 Does anybody ever run kmeans with center points provided and
 compare
   the
 result with other ml-library?

Re: kmeans result is different from scikit-learn result with center points provided

2015-01-05 Thread Ted Dunning

What kind of synthetic data did you use?



On Mon, Jan 5, 2015 at 8:29 PM, Lee S sle...@gmail.com wrote:

 Hi, I used the synthetic data to test the kmeans method.
 And I write the code own to convert center points to sequecefiles.
 Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd
 1  -cl) ,
 I compared the dumped clusteredPoints with the result of scikit-learn kmens
 result, it's totally different. I'm very confused.

 Does anybody ever run kmeans with center points provided and compare the
 result with other ml-library?

[jira] [Assigned] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

2014-12-24 Thread Ted Dunning (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning reassigned MAHOUT-1636:
---

Assignee: Ted Dunning

 Class dependencies for the spark module are put in a job.jar, which is very 
 inefficient
 ---

 Key: MAHOUT-1636
 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
 Project: Mahout
  Issue Type: Bug
  Components: spark
Affects Versions: 1.0-snapshot
Reporter: Pat Ferrel
Assignee: Ted Dunning
 Fix For: 1.0-snapshot


 using a maven plugin and an assembly job.xml a job.jar is created with all 
 dependencies including transitive ones. This job.jar is in 
 mahout/spark/target and is included in the classpath when a Spark job is run. 
 This allows dependency classes to be found at runtime but the job.jar include 
 a great deal of things not needed that are duplicates of classes found in the 
 main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
 needed classes. A better way needs to be implemented for including class 
 dependencies.
 I'm not sure what that better way is so am leaving the assembly alone for 
 now. Whoever picks up this Jira will have to remove it after deciding on a 
 better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

2014-12-24 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258393#comment-14258393
 ] 

Ted Dunning commented on MAHOUT-1636:
-


The MIT license is one of the most liberal licenses around and is completely 
compatible with Apache as a dependency.

You can find more information including a list of the so-called category A 
(totally OK) licenses and the category X (no way, no how) licenses here: 
http://www.apache.org/legal/resolved.html#category-a

 Class dependencies for the spark module are put in a job.jar, which is very 
 inefficient
 ---

 Key: MAHOUT-1636
 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
 Project: Mahout
  Issue Type: Bug
  Components: spark
Affects Versions: 1.0-snapshot
Reporter: Pat Ferrel
Assignee: Ted Dunning
 Fix For: 1.0-snapshot


 using a maven plugin and an assembly job.xml a job.jar is created with all 
 dependencies including transitive ones. This job.jar is in 
 mahout/spark/target and is included in the classpath when a Spark job is run. 
 This allows dependency classes to be found at runtime but the job.jar include 
 a great deal of things not needed that are duplicates of classes found in the 
 main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
 needed classes. A better way needs to be implemented for including class 
 dependencies.
 I'm not sure what that better way is so am leaving the assembly alone for 
 now. Whoever picks up this Jira will have to remove it after deciding on a 
 better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: The next time someone wants to help

2014-12-12 Thread Ted Dunning

Hadoop dependencies are a quagmire.

It would be far preferable to rewrite the necessary serialization to avoid
Hadoop dependencies entirely.

If we dropping the MR code, why do we need to reference the VectorWritable
class at all?

Even in the worse case, we could simply recode the binary layer from
scratch without the heinous dependencies.



On Fri, Dec 12, 2014 at 10:06 AM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 A bit more detail on what needs to happen here IMO:

 Likely, hadoop-releated things we still need for spark etc. like
 VectorWritable need to be factored out into a (new) module mahout-hadoop or
 something. Important notion here is that we only want to depend on
 hadoop-commons, which in theory should be common for both new and old
 hadoop MR apis. We may face the fact that we need hdfs as well there; e.g.
 perhaps for reading sequence file headers, not sure; but we definitely do
 not need anything mapreduce.

 Math still cannot depend on that mahout-hadoop since math must not depend
 on anything hadoop, that was the premise since like the beginning.
 Mahout-math is in-core ops only, lightweight, self-contained thing.

 more likely, spark module (and maybe some others if they use that) will
 have to depend on hadoop serialization for vectors and matrices directly,
 i.e. on mahout-hadoop.

 mrlegacy stuff of course needs to be completely isolated (nobody else
 depends on it) and made dependent on mahout-hadoop as well.

 On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  The next time someone wants to get into contributing to Mahout, wouldn’t
  it be nice to prune dependencies?
 
  For instance Spark depends on math-scala, which depends on math—at least
  ideally but in reality dependencies include mr-legacy. If some things
 were
  refactored into math we might have a much streamlined dependency tree.
 Some
  things in Math also can be replaced with newer Scala libs and so could be
  moved out to a java-common or something that would not be required by the
  Scala code.
 
  If people are going to use the V1 version of Mahout it would be nice if
  the choice didn’t force them to drag along all the legacy code if it
 isn’t
  being used.

Re: I would like to contribute to the Mahout library

2014-11-26 Thread Ted Dunning

On Thu, Nov 27, 2014 at 6:11 AM, Ray rtmel...@gmail.com wrote:

 1) Sign up to maintain the fpgrowth code, with the thought of adding some
 alternative to the Hadoop MapReduce portion of the implementation.

 2) Is there still interest in a deep autoencoder for time series?


Both of these are of interest, the first particularly so since several
people have asked about this lately.

Having a non-map-reduce version of fp-growth would make it possible to
maintain that code going forward.

Re: elementwise operator improvements experiments

2014-11-17 Thread Ted Dunning

Isn't it true that sparse iteration should always be used for m := f iff

1) the matrix argument is sparse

AND

2) f(0) == 0

?

Why the need for syntactic notation at all?  This property is much easier
to test than commutativity.


On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 another thing is that optimizer isn't capable of figuring out all
 elementwise fusions in an elementwise expression, e.g. it is not seing
 commutativity rewrites such as

 A * B * A

 should optimally be computed as sqr(A) * B (it will do it as two pairwise
 operators (A*B)*A).

 Bummer.

 To do it truly right, it needs to fuse entire elementwise expressions first
 and then optmize them separately.

 Ok that's probably too much for now. I am quite ok with writing something
 like -0.5 * (a * a ) for now.

 On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  PS
 
  actually applying an exponent funciton in place will require addtional
  underscore it looks. It doesn't want to treat function name as function
  type in this context for some reason (although it does not require
 partial
  syntax when used in arguments inside parenthesis):
 
  m := exp _
 
  Scala is quirky this way i guess
 
 
  On Sat, Nov 15, 2014 at 10:02 AM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  So i did quick experimentation with elementwise operator improvements:
 
  (1) stuff like 1 + exp (M):
 
  (1a): this requires generalization in optimizer for elementwise unary
  operators. I've added things like notion if operators require non-zero
  iteration only or not.
 
  (1b): added fusion of elemntwise operators, i.e.
 ew(1+, ew(exp, A)) is rewritten as ew (1+exp, A) for performance
  reasons. It still uses an application of a fold over functional monoid,
 but
  i think it should be fairly ok performance/DSL trade-off here. to get it
  even better, we may add functional assignment syntax to distributed
  operands similar to in-memory types as descrbed further down.
 
  (1c): notion that self elementwise things such as expr1 * expr1 (which
 is
  surprisingly often ocurrence, e..g in Torgerson MDS) are rewritten as
 ew(A,
  square) etc.
 
  So that much works. (Note that this also obsoletes dedicated
  scalar/matrix elementwise operators that there currently are). Good.
 
 
  The problem here is that (of course!) semantics of the scala language
 has
  problem importing something like exp(Double):Double alongside with
  exp(DRM):DRM apparently because it doesn't adhere to overloading rules
  (different results) so in practice even though it is allowed, one import
  overshadows the other.
 
  Which means, for the sake of DSL we can't have exp(matrix), we have to
  name it something else. Unless you see a better solution.
 
  So ... elementwise naming options:
 
  Matrix: mexp(m), msqrt(m). msignum(m)
  Vector: vexp(v), vsqrt(v)...
  DRM: dexp(drm), dsqrt(drm)  ?
 
  Let me know what you think.
 
  (2) Another problem is that actually doing something like 1+exp(m) on
  Matrix or Vector types is pretty impractical since, unlike in R (that
 can
  count number of bound variables to an object) the semantics requires
  creating a clone of m for something like exp(m) to guarantee no side
  effects on m itself.
 
  That is, expression 1 + exp(m) for Matrix or vector types causes 2
  clone-copies of original argument.
 
  actually that's why i use in-place syntax for in-memory types quite
  often, something like
 
 1+=: (x *= x)  instead of more naturally looking 1+ x * x.
 
 
  But unlike with simple elementwise operators (+=), there's no in-place
  modification syntax for a function. We could put an additional
 parameter,
  something like
 
  mexp(m, inPlace=true) but i don't like it too much.
 
  What i like much more is functional assignment (we already have
  assignment to a function (row, col, x) = Double but we can add
 elementwise
  function assignment) so that it really looks like
 
  m := exp
 
  That is pretty cool.
  Except there's a problem of optimality of assignment.
 
  There are functions here (e.g. abs, sqrt) that don't require full
  iteration but rather non-zero iteration only. by default notation
 
  m := func
 
  implies dense iteration.
 
  So what i suggest here is add a new syntax to do sparse iteration
  functional assignments:
 
  m ::= abs
 
  I actually like it (a lot) because it short and because it allows for
  more complex formulas in the same traversal, e.g. proverbial R's
 exp(m)+1
  in-place will look
 
  m := (1 + exp(_))
 
  So not terrible.
 
  What it lacks though is automatic determination of composite function
  need to apply to all vs. non-zeros only for in-memory types (for
  distributed types optimizer tracks this automatically).
 
  i.e.
 
  m := abs
 
  is not optimal (because abs doesn't affect 0s) and
 
  m ::= (abs(_) + 1)
 
  is probably also not what one wants (when we have composition of dense
  and sparse affecting functions, result is dense

Re: SGD Implementation and Questions for mapBlock like functionality

2014-11-12 Thread Ted Dunning

On Wed, Nov 12, 2014 at 9:53 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 once we start mapping aggregate, there's no reason not to
 map other engine specific capabilities, which are vast. At this point
 dilemma is, no matter what we do we are losing coherency: if we map it all,
 then other engines will have trouble supporting all of it. If we don't map
 it all, then we are forcing capability reduction compared to what the
 engine actually can do.

 It is obvious to me that all-reduce aggregate will make a lot of sense --
 even if it means math checkpoint. but then where do we stop in mapping
 those. E.g. do we do fold? cartesian? And what is that true reason we are
 remapping everything if it is already natively available? etc. etc. For
 myself, I still haven't figured a good answer to those .


Actually, I disagree with the premise here.

There *is* a reason not to map all other engine specific capabilities.
That reason is we don't need them.  Yet.

So far, we *clearly* need some sort of block aggregate for a host of
hog-wild sorts of algorithms.  That doesn't imply that we need all kinds of
mapping aggregates.  It just means that we are clear on one need for now.

So let's get this one in and see how far we can go.

Also, having one kind of aggregation in the DSL does not restrict anyone
from using engine specific capabilities.  It just means that one kind of
idiom can be done without engine specificity.

Re: SGD Implementation and Questions for mapBlock like functionality

2014-11-12 Thread Ted Dunning

On Wed, Nov 12, 2014 at 2:08 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Can we easily integrate t-digest for descriptives once we have block
 aggregates? This might count one more reason.


Presumably.

T-digest is already in Mahout as part of the OnlineSummarizer.

Re: Mahout 1.0 features (revisited)

2014-10-24 Thread Ted Dunning

On Thu, Oct 23, 2014 at 3:57 PM, Andrew Palumbo ap@outlook.com wrote:

 Or I can just commit as is and people can have at the organization.



Sounds good to me!

Re: Upgrade to Spark 1.1.0?

2014-10-21 Thread Ted Dunning

On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel p...@occamsmachete.com wrote:

 The problem is not in building Spark it is in building Mahout using the
 correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
 in the repos.


This should be true for MapR as well.

Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Ted Dunning

On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
 work. Does anyone object to upgrading our Spark dependency? I’m not sure if
 Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
 your Spark cluster.


It is going to have to happen sooner or later.

Sooner may actually be less total pain.

Re: How to build a recommendation system based on mahout serving millions even billions of users ?

2014-10-14 Thread Ted Dunning

You should move forward to version 0.9.

Take a look at more recent methods in this book:

https://www.mapr.com/practical-machine-learning



On Tue, Oct 14, 2014 at 2:43 AM, 王建国 jordanhao...@gmail.com wrote:

 Hi,Owen and all:
 I am a developer from china.I am building a recommendation sysytem
 based on mahhout in version-0.9.Since the userids and itemids are string,
 I need to map them to long.But I found that  there is a Long-to-Int mapping
 provided by the function int TasteHadoopUtils.idToIndex(long).
 Considering there may be millions  even billions of users,I wonder if  it
 possible to have many long mapped into one int? If ture,that does do much
 harm .
 This is quite confusing.What solution should I choose in this
 situation?Meanwhile,I read the answer from you as followed.Could you please
 tell me
 which data structure indexed by long you use in Myrrix. Thanks in advance.
 wangjiangwei

 Question:
 I have read some code about item-based recommendation in version-0.6,
 starting from org.apache.mahout.cf.taste.
 hadoop.item.RecommenderJob. I found that there is a Long-to-Int mapping
 provided by the function int TasteHadoopUtils.idToIndex(long).
 Long-to-Int is performed both on userId and itemId. I wonder if it possible
 to have two long mapped into one int? If it is the case, then we would
 likely to merge vectors from different itemids/uids, right? This is quite
 confusing.
 Is it better to provide a RandomAccessSparseVector implemented by
 OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
 Wei Feng
 Answer:
 That's right. It ought to be uncommon but can happen. For recommenders,
 it
 only means that you start to treat two users or two items as the same
 thing. That doesn't do much harm though. Maybe one user's recs are a little
 funny.
 I do think it would have been useful to index by long, but that would have
 significantly increased memory requirements too.
 (In developing Myrrix I have switched to use a data structure indexed by
 long though, because it becomes more necessary to avoid the mapping.)
 Sean Owen

Re: The portability of MAHOUT platform to python

2014-10-13 Thread Ted Dunning

It is plausible to port some of the newer scala stuff to python.  It would
take some thought about the right way to do it.

The kicker is going to be that a lot of what Mahout does bottoms out in
math that is written in Java.  How that would work from Python is
mysterious to me.


On Mon, Oct 13, 2014 at 9:18 PM, Vibhanshu Prasad vibhanshugs...@gmail.com
wrote:

 Hello Everyone,

 I am a college student who wants to contribute towards the development of
 the mahout library. I have been using this for last 1 year and was
 mesmerized by its features.

 I wanted to know if someone is working towards exporting this whole
 platform to python.

 If no, then is there is any possible way i can start doing it. provided
 that I am not a committer yet .

 Regards
 Vibhanshu

Re: https://mahout.apache.org/developers/buildingmahout.html

2014-10-01 Thread Ted Dunning

I believe that the POM assumes particular versions as listed are version 2
and all others 1.

Inspection of the top-level pom would provide the most authoritative answer.

On Wed, Oct 1, 2014 at 7:08 AM, jay vyas jayunit100.apa...@gmail.com
wrote:

 hi mahout:

 Can we use any hadoop version to build mahout?  i.e. 2.4.1 ?
 It seems like if you give it a garbage hadoop version i.e. (2.3.4.5) , it
 still builds, yet
 at runtime, it is clear that the version built is a 1.x version.

 thanks !

 FYI this is in relation to BIGTOP=-1470, where we are just getting ready
 for our 0.8 release, so any feedback would be much appreciated !

 --
 jay vyas

Re: Interested in developing for mahout

2014-09-29 Thread Ted Dunning

Thejas,

A good starter task would be to gather the discussions about the new
recommendation system in Scala and write up a tutorial for using it.

Writing new bindings in the math section requires a bit of advanced
knowledge of Scala and an ability to read some subtle code.  Probably not
the best starting point.



On Mon, Sep 29, 2014 at 11:34 AM, thejas prasad thejch...@gmail.com wrote:

 Hey Ted,

 It seemed interesting. I was looking at Jira and also Git and It seemed as
 though some scala bindings where already implemented.. Am I correct?

 I wanted to take up a task that is trivial since I am new to scala and
 even mahout.

 With that said I would be interested in writing more matlab bindings.

 Does that sound okay?

 -Thejas


  On Sun, Sep 28, 2014 at 3:15 PM, Aamir Khan 9aamirk...@gmail.com
 wrote:
  Hi,
 
  I am also new to Apache and Mahout. This thread caught my attention.
  Can you tell what are the areas where development is required.
  Is there any work on *Clustering*?
  Any guidance on how to start and useful links are highly appreciated.
 
  Many thanks,
 
 
  On Mon, Sep 29, 2014 at 1:19 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
   Thejas,
  
   What were your impressions?
  
   Which parts of the system match your background and capabilities?
  
  
  
   On Sun, Sep 28, 2014 at 11:46 AM, Thejas Prasad thejch...@gmail.com
   wrote:
  
Hey suneel,
   
I finished reading the paper.  What's next?
   
Sent from my iPhone
   
 On Sep 26, 2014, at 7:04 PM, Suneel Marthi smar...@apache.org
 wrote:

 See this for a start

 http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf


 On Fri, Sep 26, 2014 at 8:02 PM, thejas prasad 
 thejch...@gmail.com
wrote:

 what exactly in the  scala math library?



 On Fri, Sep 26, 2014 at 1:00 AM, Ted Dunning 
 ted.dunn...@gmail.com
 wrote:

 Got it!

 Sorry to be dense.



 On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad 
 thejch...@gmail.com
 wrote:

 Sorry I meant to say what is the best way to get started**?

 Thanks,
 Thejas
 Sent from my iPhone

 On Sep 25, 2014, at 4:28 PM, Ted Dunning 
 ted.dunn...@gmail.com
 wrote:

 On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad 
   thejch...@gmail.com

 wrote:

 what is the best way to get statues


 Hmmm

 I am totally confused.  You must have meant something here.

 Regarding your next question, the place to start work is on the
   scala
 math
 library.

Re: Interested in developing for mahout

2014-09-29 Thread Ted Dunning

Aamir,

There would be a substantial interest in clustering, especially the
adaptation of our existing streaming k-means and standard k-means to the
new math system in Scala.  Part of doing that would require some extension
of the framework to include a reduce operation.


On Sun, Sep 28, 2014 at 1:15 PM, Aamir Khan 9aamirk...@gmail.com wrote:

 Hi,

 I am also new to Apache and Mahout. This thread caught my attention.
 Can you tell what are the areas where development is required.
 Is there any work on *Clustering*?
 Any guidance on how to start and useful links are highly appreciated.

 Many thanks,


 On Mon, Sep 29, 2014 at 1:19 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Thejas,
 
  What were your impressions?
 
  Which parts of the system match your background and capabilities?
 
 
 
  On Sun, Sep 28, 2014 at 11:46 AM, Thejas Prasad thejch...@gmail.com
  wrote:
 
   Hey suneel,
  
   I finished reading the paper.  What's next?
  
   Sent from my iPhone
  
On Sep 26, 2014, at 7:04 PM, Suneel Marthi smar...@apache.org
 wrote:
   
See this for a start
http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
   
   
On Fri, Sep 26, 2014 at 8:02 PM, thejas prasad thejch...@gmail.com
 
   wrote:
   
what exactly in the  scala math library?
   
   
   
On Fri, Sep 26, 2014 at 1:00 AM, Ted Dunning ted.dunn...@gmail.com
 
wrote:
   
Got it!
   
Sorry to be dense.
   
   
   
On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad 
 thejch...@gmail.com
wrote:
   
Sorry I meant to say what is the best way to get started**?
   
Thanks,
Thejas
Sent from my iPhone
   
On Sep 25, 2014, at 4:28 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
   
On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad 
  thejch...@gmail.com
   
wrote:
   
what is the best way to get statues
   
   
Hmmm
   
I am totally confused.  You must have meant something here.
   
Regarding your next question, the place to start work is on the
  scala
math
library.

Re: Interested in developing for mahout

2014-09-28 Thread Ted Dunning

Thejas,

What were your impressions?

Which parts of the system match your background and capabilities?



On Sun, Sep 28, 2014 at 11:46 AM, Thejas Prasad thejch...@gmail.com wrote:

 Hey suneel,

 I finished reading the paper.  What's next?

 Sent from my iPhone

  On Sep 26, 2014, at 7:04 PM, Suneel Marthi smar...@apache.org wrote:
 
  See this for a start
  http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
 
 
  On Fri, Sep 26, 2014 at 8:02 PM, thejas prasad thejch...@gmail.com
 wrote:
 
  what exactly in the  scala math library?
 
 
 
  On Fri, Sep 26, 2014 at 1:00 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  Got it!
 
  Sorry to be dense.
 
 
 
  On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad thejch...@gmail.com
  wrote:
 
  Sorry I meant to say what is the best way to get started**?
 
  Thanks,
  Thejas
  Sent from my iPhone
 
  On Sep 25, 2014, at 4:28 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com
 
  wrote:
 
  what is the best way to get statues
 
 
  Hmmm
 
  I am totally confused.  You must have meant something here.
 
  Regarding your next question, the place to start work is on the scala
  math
  library.

Re: Interested in developing for mahout

2014-09-26 Thread Ted Dunning

Got it!

Sorry to be dense.



On Thu, Sep 25, 2014 at 4:23 PM, Thejas Prasad thejch...@gmail.com wrote:

 Sorry I meant to say what is the best way to get started**?

 Thanks,
 Thejas
 Sent from my iPhone

  On Sep 25, 2014, at 4:28 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com
 wrote:
 
  what is the best way to get statues
 
 
  Hmmm
 
  I am totally confused.  You must have meant something here.
 
  Regarding your next question, the place to start work is on the scala
 math
  library.

Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-25 Thread Ted Dunning

On Wed, Sep 24, 2014 at 11:09 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 Aggregate is Colt's thing. Colt (aka Mahout-math) establish java-side
 concept of different function types which are unfortunately not compatible
 with Scala literals.


Dmitriy,

Is this because we have other methods that describe the characteristics of
the function?

What would be the Scala friendly idiom?  Additional traits?

Re: Interested in developing for mahout

2014-09-25 Thread Ted Dunning

On Thu, Sep 25, 2014 at 9:35 AM, Thejas Prasad thejch...@gmail.com wrote:

 what is the best way to get statues


Hmmm

I am totally confused.  You must have meant something here.

Regarding your next question, the place to start work is on the scala math
library.

Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-24 Thread Ted Dunning

Yes.  That code is computing Frobenius norm.

I can't answer the context question about Scala calling Java, however.

On Wed, Sep 24, 2014 at 9:15 PM, Saikat Kanjilal sxk1...@hotmail.com
wrote:

 Shannon/Dmitry,Quick question, I'm wanting to calculate the scala
 equivalent of the frobenius norm per this API spec in python (
 http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html),
 I dug into the mahout-math-scala project and found the following API to
 calculate the norm:








 def norm = sqrt(m.aggregate(Functions.PLUS, Functions.SQUARE))
 I believe the above is also calculating the frobenius norm, however I am
 curious why we are calling a Java API from scala, the type of m above is a
 java interface called Matrix, I'm guessing the implementation of aggregate
 is happening in the math-math-scala somewhere, is that assumption correct?
 Thanks in advance.
  From: sxk1...@hotmail.com
  To: dev@mahout.apache.org
  Subject: RE: Mahout-1539-computation of gaussian kernel between 2 arrays
 of shapes
  Date: Thu, 18 Sep 2014 12:51:36 -0700
 
  Ok great I'll use the cartesian spark API call, so what I'd still like
 some thoughts on where the code that calls the cartesian should live in our
 directory structure.
   Date: Thu, 18 Sep 2014 15:33:59 -0400
   From: squ...@gatech.edu
   To: dev@mahout.apache.org
   Subject: Re: Mahout-1539-computation of gaussian kernel between 2
 arrays of shapes
  
   Saikat,
  
   Spark has the cartesian() method that will align all pairs of points;
   that's the nontrivial part of determining an RBF kernel. After that
 it's
   a simple matter of performing the equation that's given on the
   scikit-learn doc page.
  
   However, like you said it'll also have to be implemented using the
   Mahout DSL. I can envision that users would like to compute pairwise
   metrics for a lot more than just RBF kernels (pairwise Euclidean
   distance, etc), so my guess would be a DSL implementation of
 cartesian()
   is what you're looking for. You can build the other methods on top of
 that.
  
   Correct me if I'm wrong.
  
   Shannon
  
   On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:
   
 http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
I need to implement the above in the scala world and expose a DSL
 API to call the computation when computing the affinity matrix.
   
From: ted.dunn...@gmail.com
Date: Thu, 18 Sep 2014 10:04:34 -0700
Subject: Re: Mahout-1539-computation of gaussian kernel between 2
 arrays of shapes
To: dev@mahout.apache.org
   
There are number of non-traditional linear algebra operations like
 this
that are important to implement.
   
Can you describe what you intend to do so that we can discuss the
 shape of
the API and computation?
   
   
   
On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal 
 sxk1...@hotmail.com
wrote:
   
Dmitry et al,As part of the above JIRA I need to calculate the
 gaussian
kernel between 2 shapes, I looked through mahout-math-scala and
 didnt see
anything to do this, any objections to me adding some code under
scalabindings to do this?
Thanks in advance.

Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Ted Dunning

There are number of non-traditional linear algebra operations like this
that are important to implement.

Can you describe what you intend to do so that we can discuss the shape of
the API and computation?



On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal sxk1...@hotmail.com
wrote:

 Dmitry et al,As part of the above JIRA I need to calculate the gaussian
 kernel between 2 shapes, I looked through mahout-math-scala and didnt see
 anything to do this, any objections to me adding some code under
 scalabindings to do this?
 Thanks in advance.

Re: rowsimilarity

2014-09-18 Thread Ted Dunning

LLR with text is commonly done (that is where it comes from).

The simple approach would be to have sentences be users and words be items.
 This will result in word-word connections.

This doesn't directly give document-document similarities.  That could be
done by transposing the original data (word is user, document is item) but
I don't quite understand how to interpret that.  Another approach is simply
using term weighting and document normalization and scoring every doc
against every other.  That comes down to a matrix multiplication which is
very similar to the transposed LLR problem so that may give an
interpretation.


On Mon, Aug 25, 2014 at 10:15 AM, Pat Ferrel p...@occamsmachete.com wrote:

 LLR with text or non-interaction data. What do we use for counts? Do we
 care how many times a token is seen in a doc for instance or do we just
 look to see if it was seen. I assume the later, which means we need a new
 numNonZeroElementsPerRow several places in math-scala, right?

 All the same questions are going to come up over this as did for
 numNonZeroElementsPerColumn so please speak now or I’ll just mirror its
 implementation.


 On Aug 25, 2014, at 9:38 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Turning itemsimilarity into rowsimilarity if fairly simple but requires
 altering CooccurrenceAnalysis.cooccurrence to swap the transposes and
 calculate the LLR values for rows rather than columns. The input will be
 something like a DRM. Row similarity does something like AA’ with LLR
 weighting and uses similar downsampling as I take it from the Hadoop code.
 Let me know if I’m on the wrong track here.

 With the new application ID preserving code the following input could be
 directly processed (it’s my test case)

 doc1\tNow is the time for all good people to come to aid of their party
 doc2\tNow is the time for all good people to come to aid of their country
 doc3\tNow is the time for all good people to come to aid of their hood
 doc4\tNow is the time for all good people to come to aid of their friends
 doc5\tNow is the time for all good people to come to aid of their looser
 brother
 doc6\tThe quick brown fox jumped over the lazy dog
 doc7\tThe quick brown fox jumped over the lazy boy
 doc8\tThe quick brown fox jumped over the lazy cat
 doc9\tThe quick brown fox jumped over the lazy wolverine
 doc10\tThe quick brown fox jumped over the lazy cantelope

 The output will be something like the following, with or without LLR
 strengths.
 doc1\tdoc2 doc3 doc4 doc5
 …
 doc6\tdoc7 doc8 doc9 doc10
 ...

 It would be pretty easy to tack on a text analyzer from lucene to turn
 this into a full function doc similarity job since LLR doesn’t need TF-IDF.

 One question is: is there any reason to do the cross-similarity in RSJ, so
 [AB’]? I can’t picture what this would mean so am assuming the answer is no.

Re: [jira] [Commented] (MAHOUT-1610) Tests can be made more robust to pass in Java 8

2014-08-28 Thread Ted Dunning

On Thu, Aug 28, 2014 at 6:04 AM, ASF GitHub Bot (JIRA) j...@apache.org
wrote:

 Github user srowen commented on the pull request:

 https://github.com/apache/mahout/pull/46#issuecomment-53716190

 I may still have the commit bit for ASF git, but can't merge the pull
 request myself. (I also realize I'm not yet sure if there's another step?
 will asfbot merge back to ASF git if merged here?)


If you do the commit with the github note closes #xx, then github does
the right thing.  Your commit does the merge.

[jira] [Commented] (MAHOUT-1610) Tests can be made more robust to pass in Java 8

2014-08-27 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112304#comment-14112304
 ] 

Ted Dunning commented on MAHOUT-1610:
-

Looks good to me.

 Tests can be made more robust to pass in Java 8
 ---

 Key: MAHOUT-1610
 URL: https://issues.apache.org/jira/browse/MAHOUT-1610
 Project: Mahout
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.9
 Environment: Java 1.8.0_11 OS X 10.9.4
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
  Labels: java8, tests

 Right now, several tests don't seem to pass when run with Java 8 (at least on 
 Java 8). The failures are benign, and just due to tests looking for 
 too-specific values or expecting things like a certain ordering of hashmaps. 
 The tests can easily be made to pass both Java 8 and Java 6/7 at the same 
 time by either relaxing the tests in a principled way, or accepting either 
 output of two equally valid ones as correct.
 (There's also one curious compilation failure in Java 8, related to generics. 
 It is fixable by changing to a more explicit declaration that should be 
 equivalent. It should be entirely equivalent at compile time, and of course, 
 at run time. I am not sure it's not just a javac bug, but, might as well work 
 around when it's so easy.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Features by engine page

2014-08-26 Thread Ted Dunning

On Mon, Aug 25, 2014 at 2:40 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 This work is obviously also interesting in that it
 establishes probabilistic framework in Mahout (distributions  gaussian
 process).


We already have that.

(distributions not GP)

Note that we also have an implementation of recorded step evolutionary
programming that works really well for hyper-parameter search.  I don't
like the way that the API turned out (too hard to understand).

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2209 matches

Mail list logo