Hi, I am using 1 master and 3 slave workers for processing 27gb of Wikipedia data that is tab separated and every line contains wikipedia page information. The tab separated data has title of the page and the page contents. I am using the regular expression to extract links as mentioned in the site below: http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html#running-pagerank-on-wikipedia
Although it runs fne for around 300Mb data set, it runs in to issues when I try to execute the same code using the 27gb data from hdfs. The error thrown is given below: 14/05/05 07:15:22 WARN scheduler.TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.regex.Matcher.<init>(Matcher.java:224) Is there any way to over come this issue? My cluster is a ec2 m3.large machine. Thanks Ajay -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-spark-on-27gb-wikipedia-data-tp6487.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.