Could you try to see which phase is causing the hang ? i.e. If you do a count() after flatMap does that work correctly ? My guess is that the hang is somehow related to data not fitting in the R process memory but its hard to say without more diagnostic information.
Thanks Shivaram On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander < alek.eskil...@cerner.com> wrote: > I’ve been attempting to run a SparkR translation of a similar Scala job > that identifies words from a corpus not existing in a newline delimited > dictionary. The R code is: > > dict <- SparkR:::textFile(sc, src1) > corpus <- SparkR:::textFile(sc, src2) > words <- distinct(SparkR:::flatMap(corpus, function(line) { > gsub(“[[:punct:]]”, “”, tolower(strsplit(line, “ |,|-“)[[1]]))})) > found <- subtract(words, dict) > > (where src1, src2 are locations on HDFS) > > Then attempting something like take(found, 10) or saveAsTextFile(found, > dest) should realize the collection, but that stage of the DAG hangs in > Scheduler Delay during the collectPartitions phase. > > Synonymous Scala code however, > val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“)) > val dict = sc.textFile(src2) > val words = corpus.map(word => > word.filter(Character.isLetter(_))).disctinct() > val found = words.subtract(dict) > > performs as expected. Any thoughts? > > Thanks, > Alek Eskilson > CONFIDENTIALITY NOTICE This message and any included attachments are from > Cerner Corporation and are intended only for the addressee. The information > contained in this message is confidential and may constitute inside or > non-public information under international, federal, or state securities > laws. Unauthorized forwarding, printing, copying, distribution, or use of > such information is strictly prohibited and may be unlawful. If you are not > the addressee, please promptly delete this message and notify the sender of > the delivery error by e-mail or you may call Cerner's corporate offices in > Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >