[MLlib] PCA Aggregator

2018-10-17 Thread Matt Saunders
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). So I built a little Aggregator that can do that, here’s an example

Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-17 Thread DB Tsai
I'll +1 on removing those legacy mllib code. Many users are confused about the APIs, and some of them have weird behaviors (for example, in gradient descent, the intercept is regularized which supports not to). DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc >

Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-17 Thread shane knapp
On Wed, Oct 17, 2018 at 10:25 AM Yin Huai wrote: > Shane, Thank you for initiating this work! Can we do an audit of jenkins > users and trim down the list? > > re pruning external (spark-specific) users w/shell and jenkins login access: we can absolutely do this. limiting logins for EECS

Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-17 Thread Yin Huai
Shane, Thank you for initiating this work! Can we do an audit of jenkins users and trim down the list? Also, for packaging jobs, those branch snapshot jobs are active (for example, https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ for publishing

Re: some doubt on code understanding

2018-10-17 Thread Sandeep Katta
:) thanks I am wondering how did I miss that :) :) On Wed, 17 Oct 2018 at 21:58, Sean Owen wrote: > "/" is integer division, so "x / y * y" is not x, but more like the > biggest multiple of y that's <= x. > On Wed, Oct 17, 2018 at 11:25 AM Sandeep Katta > wrote: > > > > Hi Guys, > > > > I am

Re: some doubt on code understanding

2018-10-17 Thread Sean Owen
"/" is integer division, so "x / y * y" is not x, but more like the biggest multiple of y that's <= x. On Wed, Oct 17, 2018 at 11:25 AM Sandeep Katta wrote: > > Hi Guys, > > I am trying to understand structured streaming code flow by doing so I came > across below code flow > > def

Re: some doubt on code understanding

2018-10-17 Thread Reynold Xin
Rounding. On Wed, Oct 17, 2018 at 6:25 PM Sandeep Katta < sandeep0102.opensou...@gmail.com> wrote: > Hi Guys, > > I am trying to understand structured streaming code flow by doing so I > came across below code flow > > def nextBatchTime(now: Long): Long = { > if (intervalMs == 0) now else now

some doubt on code understanding

2018-10-17 Thread Sandeep Katta
Hi Guys, I am trying to understand structured streaming code flow by doing so I came across below code flow def nextBatchTime(now: Long): Long = { if (intervalMs == 0) now else now / intervalMs * intervalMs + intervalMs } else part could also have been written as now + intervalMs is there

Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-17 Thread Erik Erlandson
My understanding was that the legacy mllib api was frozen, with all new dev going to ML, but it was not going to be removed. Although removing it would get rid of a lot of `OldXxx` shims. On Wed, Oct 17, 2018 at 12:55 AM Marco Gaido wrote: > Hi all, > > I think a very big topic on this would

Re: Hadoop 3 support

2018-10-17 Thread Hyukjin Kwon
See the discussion at https://github.com/apache/spark/pull/21588 2018년 10월 17일 (수) 오전 5:06, t4 님이 작성: > has anyone got spark jars working with hadoop3.1 that they can share? i am > looking to be able to use the latest hadoop-aws fixes from v3.1 > > > > -- > Sent from:

Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-17 Thread Marco Gaido
Hi all, I think a very big topic on this would be: what do we want to do with the old mllib API? For long I have been told that it was going to be removed on 3.0. Is this still the plan? Thanks, Marco Il giorno mer 17 ott 2018 alle ore 03:11 Marcelo Vanzin ha scritto: > Might be good to take