Could you try to see which phase is causing the hang ? i.e. If you do a
count() after flatMap does that work correctly ? My guess is that the hang
is somehow related to data not fitting in the R process memory but its hard
to say without more diagnostic information.

Thanks
Shivaram

On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander <
alek.eskil...@cerner.com> wrote:

>  I’ve been attempting to run a SparkR translation of a similar Scala job
> that identifies words from a corpus not existing in a newline delimited
> dictionary. The R code is:
>
>  dict <- SparkR:::textFile(sc, src1)
> corpus <- SparkR:::textFile(sc, src2)
> words <- distinct(SparkR:::flatMap(corpus, function(line) {
> gsub(“[[:punct:]]”, “”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
> found <- subtract(words, dict)
>
>  (where src1, src2 are locations on HDFS)
>
>  Then attempting something like take(found, 10) or saveAsTextFile(found,
> dest) should realize the collection, but that stage of the DAG hangs in
> Scheduler Delay during the collectPartitions phase.
>
>  Synonymous Scala code however,
> val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
> val dict = sc.textFile(src2)
> val words = corpus.map(word =>
> word.filter(Character.isLetter(_))).disctinct()
> val found = words.subtract(dict)
>
>  performs as expected. Any thoughts?
>
>  Thanks,
> Alek Eskilson
>  CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>

Reply via email to