Called the repartition method at line 2 in the above code, and the error is no more reported. JavaRDD<List<String>> terms = getContent(input).repartition(10);
But I am curious if this is correct approach and for any inputs/suggestions towards optimization of the above code? On Fri, Nov 4, 2016 at 11:13 AM, Reth RM <reth.ik...@gmail.com> wrote: > Hi, > > Can you please guide me through parallelizing the task of extracting > webpages text, converting text to doc vectors and finally applying k-mean. > I get a "GC overhead limit exceeded at java.util.Arrays.copyOfRange" at > task 3 below. detail stack trace : https://jpst.it/P33P > > Right now webpage files are 100k. Current approach: 1) I am using > wholeTextFiles apis to load the 1M webpages, 2) PairRDD to extract content > and convert to tokens. 4) Passing this array to convert to doc-vectors and > finally passing vectors to Kmean. 5) Running job spark-submit, > standalone, ./spark-submit --master spark://host:7077 --executor-memory 4g > --driver-memory 4g --class sfsu.spark.main.webmain > /clustering-1.0-SNAPSHOT.jar > > Code snippet as below, I think I should parallelize task 3 or I am doing > something really wrong, could you please point me to mistakes here? > > > 1. JavaPairRDD<String, String> input = sc.wholeTextFiles(webFilesPath); > > 2. JavaRDD<List<String>> terms = getContent(input); > > 3. public JavaRDD<List<String>> getContent(JavaPairRDD<String, String> input) > { > return input.map(new Function<Tuple2<String, String>, List<String>>() { > public List<String> call(Tuple2<String, String> tuple) throws > Exception { > return Arrays.asList(tuple._2().replaceAll("[^A-Za-z']+", " > ").trim().toLowerCase().split("\\W+")); > } > }); > } > > 4. JavaRDD<Vector> tfVectors = tf.transform(terms).cache(); > > 5. KMeansModel model = train(vectors, kMeanProperties); > >