I think this is more suited to the user mailing list than the dev one, but this 
almost always means you need to repartition your data into smaller partitions 
as one of the partitions is over 2GB.

When you create your dataset, put something like . repartition(1000) at the end 
of the command creating the initial dataframe or dataset.

Ewan

On 15 Aug 2016 17:46, Minudika Malshan <minudika...@gmail.com> wrote:
Hi all,

I am trying to create and train a model for a Kaggle competition dataset using 
Apache spark. The dataset has more than 10 million rows of data.
But when training the model, I get an exception "Size exceeds 
Integer.MAX_VALUE".

I found the same question has been raised in Stack overflow but those answers 
didn't help much.

It would be a great if you could help to resolve this issue.

Thanks.
Minudika


This email and any attachments to it may contain confidential information and 
are intended solely for the addressee and. If you are not the intended 
recipient of this email or if you have believe you have received this email in 
error, please contact the sender and remove it from your system. Do not use, 
copy or disclose the information contained in this email or in any attachment. 
RealityMine Limited may monitor email traffic data. RealityMine Limited may 
monitor email traffic data and also the content of email for the purposes of 
security. RealityMine Limited is a company registered in England and Wales. 
Registered number: 07920936 Registered office: Warren Bruce Court, Warren Bruce 
Road, Trafford Park, Manchester M17 1LB

Reply via email to