I also would suggest to take some guinea pigs to validate stuff. E.g. if i may make a suggestion, let's see how we'd do a categorical variable vectorization into predictor variables in our would-be language here.
On Wed, Apr 30, 2014 at 11:40 AM, Dmitriy Lyubimov <[email protected]>wrote: > > > > On Wed, Apr 30, 2014 at 10:53 AM, Dmitriy Lyubimov <[email protected]>wrote: > >> +1. >> >> And the greatest benefit of data frames work is standardization of >> feature extraction in Mahout, not necessarily any particular algorithms. >> This has been the thorniest issue in the history and nobody does it well >> today as it stands. >> > > Correction: nobody does it well in open source and in distributed way, > that is. > > >> If we tackle feature prep techniques in engine-agnostic way, this would >> be truly unique differentiation factor for Mahout. >> >> >> >> On Wed, Apr 30, 2014 at 7:52 AM, Sebastian Schelter <[email protected]>wrote: >> >>> I think you should concentrate on MAHOUT-1490, that is a highly >>> important task that will be the foundation for a lot of stuff to be built >>> on top. Let's focus on getting this thing right and then move on to other >>> things. >>> >>> --sebastian >>> >>> >>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote: >>> >>>> Sebastien/Dmitry,In looking through the current list of issues I didnt >>>> see other algorithms in mahout that are talked about being ported to spark, >>>> I was wondering if there's any interest/need in porting or writing things >>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while >>>> working on 1490. Also are we planning to port the distributed versions of >>>> taste to use spark as well at some point. >>>> Thanks in advance. >>>> >>>> >>> >> >
