Hi,

 Can you please guide me through parallelizing the task of extracting
webpages text, converting text to doc vectors and finally applying k-mean.
I get a  "GC overhead limit exceeded at java.util.Arrays.copyOfRange" at
task 3 below.  detail stack trace : https://jpst.it/P33P

Right now webpage files are 100k. Current approach:  1) I am using
wholeTextFiles apis to load the 1M webpages, 2) PairRDD to extract content
and convert to tokens.  4) Passing this array to convert to doc-vectors and
finally passing vectors to Kmean. 5) Running job spark-submit,
standalone, ./spark-submit --master spark://host:7077 --executor-memory 4g
--driver-memory 4g --class sfsu.spark.main.webmain
/clustering-1.0-SNAPSHOT.jar

Code snippet as below, I think I should parallelize task 3 or I am doing
something really wrong, could you please point me to mistakes here?


1. JavaPairRDD<String, String> input = sc.wholeTextFiles(webFilesPath);

2. JavaRDD<List<String>> terms = getContent(input);

3. public JavaRDD<List<String>> getContent(JavaPairRDD<String, String> input) {
    return input.map(new Function<Tuple2<String, String>, List<String>>() {
        public List<String> call(Tuple2<String, String> tuple) throws
Exception {
            return Arrays.asList(tuple._2().replaceAll("[^A-Za-z']+",
" ").trim().toLowerCase().split("\\W+"));
        }
    });
}

4. JavaRDD<Vector> tfVectors = tf.transform(terms).cache();

5. KMeansModel model = train(vectors, kMeanProperties);

Reply via email to