Re: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Holden Karau
r: @nirvanainternat > > -Original Message- > From: Tim Hunter [mailto:timhun...@databricks.com] > Sent: Friday, February 17, 2017 1:49 PM > To: bradc > Cc: dev@spark.apache.org > Subject: Re: Design document - MLlib's statistical package for DataFrames > > Hi Brad

RE: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Pritish Nawlakhe
-Original Message- From: Tim Hunter [mailto:timhun...@databricks.com] Sent: Friday, February 17, 2017 1:49 PM To: bradc Cc: dev@spark.apache.org Subject: Re: Design document - MLlib's statistical package for DataFrames Hi Brad, this task is focusing on moving the existing algorithms

Re: Design document - MLlib's statistical package for DataFrames

2017-02-17 Thread Tim Hunter
Hi Brad, this task is focusing on moving the existing algorithms, so that we are held up by parity issues. Do you have some paper suggestions for cardinality? I do not think there is a feature request on JIRA either. Tim On Thu, Feb 16, 2017 at 2:21 PM, bradc wrote: >

Re: Design document - MLlib's statistical package for DataFrames

2017-02-16 Thread bradc
Hi, While it is also missing in spark.mllib, I'd suggest adding cardinality as part of the Simple descriptive statistics for both spark.ml and spark.mlib? This is useful even for data in double precision FP to understand the "uniqueness" of the feature data. Cheers, Brad -- View this

Design document - MLlib's statistical package for DataFrames

2017-02-16 Thread Tim Hunter
Hello all, I have been looking at some of the missing items for complete feature parity between spark.ml and spark.mllib. Here is a proposal for porting mllib.stats, the descriptive statistics package: