Classification: Limited

Many thanks for your response Sean.

Question - why spark is overkill for this and why is sklearn is faster please?  
It's the same algorithm right?

Thanks again,
Dave Williams

From: Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>>
Sent: 25 March 2021 16:40
To: Williams, David (Risk Value Stream) 
<david.willi...@lloydsbanking.com<mailto:david.willi...@lloydsbanking.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: FW: Email to Spark Org please

-- This email has reached the Bank via an external source --

Spark is overkill for this problem; use sklearn.
But I'd suspect that you are using just 1 partition for such a small data set, 
and get no parallelism from Spark.
repartition your input to many more partitions, but, it's unlikely to get much 
faster than in-core sklearn for this task.

On Thu, Mar 25, 2021 at 11:39 AM Williams, David (Risk Value Stream) 
<david.willi...@lloydsbanking.com.invalid<mailto:david.willi...@lloydsbanking.com.invalid>>
 wrote:

Classification: Public

Hi Team,

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and 
found the performance of the algorithm is very poor during training.

We would like to see we can improve the performance timings since, it is taking 
2 days for training for a smaller dataset.

Our dataset size is 40000. Number of features used for training is 564.

The same dataset when we use in Sklearn python training is completed in 3 hours 
but when used ML Gradient Boosting it is taking 2 days.

We tried increasing number of executors, executor cores, driver memory etc but 
couldn't see any improvements.

The following are the parameters used for training.

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', 
predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, 
subsamplingRate=0.5, minInstancesPerNode=110)

If you could help us with any suggestions to tune this,  that will be really 
helpful

Many thanks,
Dave Williams
Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. 
Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. 
Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. 
Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London 
EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham 
Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are 
authorised by the Prudential Regulation Authority and regulated by the 
Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by 
the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned 
subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets 
Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 
6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht 
Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets 
Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für 
Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in 
Scotland no. SC218813.



This e-mail (including any attachments) is private and confidential and may 
contain privileged material. If you have received this e-mail in error, please 
notify the sender and delete it (including any attachments) immediately. You 
must not copy, distribute, disclose or use any of the information in it or any 
attachments. Telephone calls may be monitored or recorded.

Reply via email to