Called the repartition method at line 2 in the above code, and the error is
no more reported.
JavaRDD<List<String>> terms = getContent(input).repartition(10);

But I am curious if this is correct approach and for any inputs/suggestions
towards optimization of the above code?



On Fri, Nov 4, 2016 at 11:13 AM, Reth RM <reth.ik...@gmail.com> wrote:

> Hi,
>
>  Can you please guide me through parallelizing the task of extracting
> webpages text, converting text to doc vectors and finally applying k-mean.
> I get a  "GC overhead limit exceeded at java.util.Arrays.copyOfRange" at
> task 3 below.  detail stack trace : https://jpst.it/P33P
>
> Right now webpage files are 100k. Current approach:  1) I am using
> wholeTextFiles apis to load the 1M webpages, 2) PairRDD to extract content
> and convert to tokens.  4) Passing this array to convert to doc-vectors and
> finally passing vectors to Kmean. 5) Running job spark-submit,
> standalone, ./spark-submit --master spark://host:7077 --executor-memory 4g
> --driver-memory 4g --class sfsu.spark.main.webmain
> /clustering-1.0-SNAPSHOT.jar
>
> Code snippet as below, I think I should parallelize task 3 or I am doing
> something really wrong, could you please point me to mistakes here?
>
>
> 1. JavaPairRDD<String, String> input = sc.wholeTextFiles(webFilesPath);
>
> 2. JavaRDD<List<String>> terms = getContent(input);
>
> 3. public JavaRDD<List<String>> getContent(JavaPairRDD<String, String> input) 
> {
>     return input.map(new Function<Tuple2<String, String>, List<String>>() {
>         public List<String> call(Tuple2<String, String> tuple) throws 
> Exception {
>             return Arrays.asList(tuple._2().replaceAll("[^A-Za-z']+", " 
> ").trim().toLowerCase().split("\\W+"));
>         }
>     });
> }
>
> 4. JavaRDD<Vector> tfVectors = tf.transform(terms).cache();
>
> 5. KMeansModel model = train(vectors, kMeanProperties);
>
>

Reply via email to