Thanks, that is very helpful.

Does anyone have experience Classifying via Clustering?  If so, I'd love to
hear any feedback on that technique!  The basic concept being to cluster on
each training label and corresponding data subset separately then combine
the centroids to comprise your full model.  I got the idea from here:
https://onlinecourses.science.psu.edu/stat557/book/export/html/63

If this is a Classification technique with exploring, I'd like to know if
there any wrapper functions out there to help enable it?  Otherwise, it
would feel pretty monotonous to run clustering on every possible label
value.

Thanks,
          Adam

On Fri, Dec 28, 2012 at 6:54 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> On Fri, Dec 28, 2012 at 5:30 PM, Adam Baron <adam.j.ba...@gmail.com>
> wrote:
>
> > I'm trying to get familiar with the the parallel MapReduce Classification
> > algorithms offered in Mahout .  ... Online Passive Aggressive and Hidden
> > Markov Models might be
> > ready to explore as well.
>
>
> I don't think that either of these really got to full production quality in
> Mahout. The HMM, in particular, may have slow convergence on large problems
> which is just where you want the parallel program.
>
> I thought that the Online Passive Aggressive code never made is very far,
> either.
>
>
> > Also, is there a parallel version of Logistic Regression officially in
> > Mahout?
>
>
> Nope.
>
>
> > ... I ask because I
> > came across this parallel Logistic Regression implementation which is
> > apparently based off of Mahout, though not in Mahout:
> > https://github.com/jpatanooga/KnittingBoar/wiki/Code-Development-Notes
> >
>
> Yes.  That is a personal project of Josh Patterson's.  He should comment on
> it.
>
> It appears to be based on parameter averaging [1], which is an OK approach,
> but I think that you can do better.  I would generally recommend an
> alternative with asynchronous parameter updates.  Jeff Dean describes a
> nice implementation in [2].  Josh's work is based on an experimental
> map-reduce+ implementation (where + indicates iterated reduce similar to
> BSP).  The Google learner can be implemented using the standard hack of
> long-lived mappers that simply re-read their inputs repeatedly in an
> asynchronous way.
>
> An alternative BSP implementation can be found in Giraph [3].  All BSP
> implementations tend to use batch synchronous update.
>
> Graphlab [4] uses asynchronous updates.  I don't know the details of what
> they have available.
>
>
> Also, are there any other parallel MapReduce Classification algorithms in
> > Mahout which I failed to mention worth checking out?
> >
>
> I think you did a good survey.
>
>
> [1] http://www.aclweb.org/anthology-new/N/N10/N10-1069.pdf
> [2] http://techtalks.tv/talks/57639/
> [3] http://incubator.apache.org/giraph/
> [4] http://graphlab.org/
>

Reply via email to