Collect all your rdds from single files into List<Rdd<?>>, then call
context.union(context.emptyRdd(), YOUR_LIST); Otherwise, on greater number
of elements to union, you will get stack overflow exception.

On Wed, Sep 16, 2015 at 10:17 PM, Shawn Carroll <shawn.c.carr...@gmail.com>
wrote:

> Your loop is deciding the files to process and then you are unioning the
> data on each iteration. If you change it to load all the files at the same
> time and let spark sort it out you should be much faster.
>
> Untested:
>
>  val rdd = sc.textFile("medline15n00*.xml")
>  val words = rdd.flatMap( x=> x.split(" ") );
>  words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) )
>  words.saveAsTextFile("results")
>
>
>
> shawn.c.carr...@gmail.com
> Software Engineer
> Soccer Referee
>
> On Wed, Sep 16, 2015 at 2:07 PM, huajun <huajun.w...@gmail.com> wrote:
>
>> Hi.
>> I have a problem with this very simple word count rogram. The program
>> works
>> fine for
>> thousands of similar files in the dataset but is very slow for these first
>> 28 or so.
>> The files are about 50 to 100 MB each
>> and the program process other similar 28 files in about 30sec. These first
>> 28 files, however, take 30min.
>> This should not be a problem with the data in these files, as if I combine
>> all the files into one
>> bigger file, it will be processed in about 30sec.
>>
>> I am running spark in local mode (with > 100GB memory) and it is just use
>> 100% CPU (one core) most of time (for this troubled case) and no network
>> traffic is involved.
>>
>> Any obvious (or non-obvious) errors?
>>
>>     def process(file : String) : RDD[(String, Int)] = {
>>       val rdd = sc.textFile(file)
>>       val words = rdd.flatMap( x=> x.split(" ") );
>>
>>       words.map( x=> (x,1)).reduceByKey( (x,y) => (x+y) )
>>     }
>>
>>     val file = "medline15n0001.xml"
>>     var keep = process(file)
>>
>>     for (i <- 2 to 28) {
>>       val file = if (i < 10) "medline15n000" + i + ".xml"
>>                  else "medline15n00" + i + ".xml"
>>
>>       val result = process(file)
>>       keep = result.union(keep);
>>     }
>>     keep = keep.reduceByKey( (x,y) => (x+y) )
>>     keep.saveAsTextFile("results")
>>
>> Thanks.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/problem-with-a-very-simple-word-count-program-tp24715.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to