Re: FW: Email to Spark Org please

2021-04-01 Thread Sean Owen
Yes that's a great option when the modeling process itself doesn't really need Spark. You can use any old modeling tool you want and get the parallelism in tuning via hyperopt's Spark integration. On Thu, Apr 1, 2021 at 10:50 AM Williams, David (Risk Value Stream) wrote: > Classification:

RE: FW: Email to Spark Org please

2021-04-01 Thread Williams, David (Risk Value Stream)
Stream) Cc: user@spark.apache.org Subject: Re: FW: Email to Spark Org please -- This email has reached the Bank via an external source -- Right, could also be the case that the overhead of distributing it is just dominating. You wouldn't use sklearn with Spark, just use sklearn at this scale. What

Re: FW: Email to Spark Org please

2021-03-26 Thread Sean Owen
get that working in distributed, will we get > benefits similar to spark ML? > > > > Best Regards, > > Dave Williams > > > > *From:* Sean Owen > *Sent:* 26 March 2021 13:20 > *To:* Williams, David (Risk Value Stream) > > *Cc:* user@spark.apache.org &g

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
get that working in distributed, will we get benefits similar to spark ML? Best Regards, Dave Williams From: Sean Owen Sent: 26 March 2021 13:20 To: Williams, David (Risk Value Stream) Cc: user@spark.apache.org Subject: Re: FW: Email to Spark Org please -- This email has reached the Bank via

Re: FW: Email to Spark Org please

2021-03-26 Thread Sean Owen
25 March 2021 16:40 > *To:* Williams, David (Risk Value Stream) < > david.willi...@lloydsbanking.com> > *Cc:* user@spark.apache.org > *Subject:* Re: FW: Email to Spark Org please > > > > > *-- This email has reached the Bank via an external source -- * > > Spark is overkill f

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
(Risk Value Stream) mailto:david.willi...@lloydsbanking.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: FW: Email to Spark Org please -- This email has reached the Bank via an external source -- Spark is overkill for this problem; use sklearn. But I'

Re: FW: Email to Spark Org please

2021-03-25 Thread Sean Owen
Spark is overkill for this problem; use sklearn. But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark. repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task. On Thu, Mar

FW: Email to Spark Org please

2021-03-25 Thread Williams, David (Risk Value Stream)
Classification: Public Hi Team, We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training. We would like to see we can improve the performance timings since, it is taking 2 days for training for a