Hi, Can you please guide me through parallelizing the task of extracting webpages text, converting text to doc vectors and finally applying k-mean. I get a "GC overhead limit exceeded at java.util.Arrays.copyOfRange" at task 3 below. detail stack trace : https://jpst.it/P33P
Right now webpage files are 100k. Current approach: 1) I am using wholeTextFiles apis to load the 1M webpages, 2) PairRDD to extract content and convert to tokens. 4) Passing this array to convert to doc-vectors and finally passing vectors to Kmean. 5) Running job spark-submit, standalone, ./spark-submit --master spark://host:7077 --executor-memory 4g --driver-memory 4g --class sfsu.spark.main.webmain /clustering-1.0-SNAPSHOT.jar Code snippet as below, I think I should parallelize task 3 or I am doing something really wrong, could you please point me to mistakes here? 1. JavaPairRDD<String, String> input = sc.wholeTextFiles(webFilesPath); 2. JavaRDD<List<String>> terms = getContent(input); 3. public JavaRDD<List<String>> getContent(JavaPairRDD<String, String> input) { return input.map(new Function<Tuple2<String, String>, List<String>>() { public List<String> call(Tuple2<String, String> tuple) throws Exception { return Arrays.asList(tuple._2().replaceAll("[^A-Za-z']+", " ").trim().toLowerCase().split("\\W+")); } }); } 4. JavaRDD<Vector> tfVectors = tf.transform(terms).cache(); 5. KMeansModel model = train(vectors, kMeanProperties);