Collect all your rdds from single files into List<Rdd<?>>, then call context.union(context.emptyRdd(), YOUR_LIST); Otherwise, on greater number of elements to union, you will get stack overflow exception.
On Wed, Sep 16, 2015 at 10:17 PM, Shawn Carroll <shawn.c.carr...@gmail.com> wrote: > Your loop is deciding the files to process and then you are unioning the > data on each iteration. If you change it to load all the files at the same > time and let spark sort it out you should be much faster. > > Untested: > > val rdd = sc.textFile("medline15n00*.xml") > val words = rdd.flatMap( x=> x.split(" ") ); > words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) ) > words.saveAsTextFile("results") > > > > shawn.c.carr...@gmail.com > Software Engineer > Soccer Referee > > On Wed, Sep 16, 2015 at 2:07 PM, huajun <huajun.w...@gmail.com> wrote: > >> Hi. >> I have a problem with this very simple word count rogram. The program >> works >> fine for >> thousands of similar files in the dataset but is very slow for these first >> 28 or so. >> The files are about 50 to 100 MB each >> and the program process other similar 28 files in about 30sec. These first >> 28 files, however, take 30min. >> This should not be a problem with the data in these files, as if I combine >> all the files into one >> bigger file, it will be processed in about 30sec. >> >> I am running spark in local mode (with > 100GB memory) and it is just use >> 100% CPU (one core) most of time (for this troubled case) and no network >> traffic is involved. >> >> Any obvious (or non-obvious) errors? >> >> def process(file : String) : RDD[(String, Int)] = { >> val rdd = sc.textFile(file) >> val words = rdd.flatMap( x=> x.split(" ") ); >> >> words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) ) >> } >> >> val file = "medline15n0001.xml" >> var keep = process(file) >> >> for (i <- 2 to 28) { >> val file = if (i < 10) "medline15n000" + i + ".xml" >> else "medline15n00" + i + ".xml" >> >> val result = process(file) >> keep = result.union(keep); >> } >> keep = keep.reduceByKey( (x,y) => (x+y) ) >> keep.saveAsTextFile("results") >> >> Thanks. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/problem-with-a-very-simple-word-count-program-tp24715.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >