I think this is more suited to the user mailing list than the dev one, but this
almost always means you need to repartition your data into smaller partitions
as one of the partitions is over 2GB.
When you create your dataset, put something like . repartition(1000) at the end
of the command
Hi all,
I am trying to create and train a model for a Kaggle competition dataset
using Apache spark. The dataset has more than 10 million rows of data.
But when training the model, I get an exception "*Size exceeds
Integer.MAX_VALUE*".
I found the same question has been raised in Stack overflow