I think this is more suited to the user mailing list than the dev one, but this almost always means you need to repartition your data into smaller partitions as one of the partitions is over 2GB.
When you create your dataset, put something like . repartition(1000) at the end of the command creating the initial dataframe or dataset. Ewan On 15 Aug 2016 17:46, Minudika Malshan <minudika...@gmail.com> wrote: Hi all, I am trying to create and train a model for a Kaggle competition dataset using Apache spark. The dataset has more than 10 million rows of data. But when training the model, I get an exception "Size exceeds Integer.MAX_VALUE". I found the same question has been raised in Stack overflow but those answers didn't help much. It would be a great if you could help to resolve this issue. Thanks. Minudika This email and any attachments to it may contain confidential information and are intended solely for the addressee and. If you are not the intended recipient of this email or if you have believe you have received this email in error, please contact the sender and remove it from your system. Do not use, copy or disclose the information contained in this email or in any attachment. RealityMine Limited may monitor email traffic data. RealityMine Limited may monitor email traffic data and also the content of email for the purposes of security. RealityMine Limited is a company registered in England and Wales. Registered number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, Trafford Park, Manchester M17 1LB