Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum
> On 14 Feb 2017, at 09:36, Henry Tremblay <paulhtremb...@gmail.com> wrote: > > When I use wholeTextFiles, spark does not run in parallel, and yarn runs out > of memory. > I have documented the steps below. First I copy 6 s3 files to hdfs. Then I > create an rdd by: > > > sc.wholeTextFiles("/mnt/temp") > > > Then I process the files line by line using a simple function. When I look at > my nodes, I see only one executor is running. (I assume the other is the name > node?) I then get an error message that yarn has run out of memory. > > > Steps below: > > ======================== > > [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp > Found 6 items > -rw-r--r-- 3 hadoop hadoop 3684566 2017-02-14 07:58 > /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop 3486510 2017-02-14 08:01 > /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop 3498649 2017-02-14 08:05 > /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop 4007644 2017-02-14 08:06 > /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop 3990553 2017-02-14 08:07 > /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop 3689213 2017-02-14 07:54 > /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz > > > In [6]: rdd1 = sc.wholeTextFiles("mnt/temp" > In [7]: rdd1.count() > Out[7]: 6 > > def process_file(s): > text = s[1] > d = {} > l = text.split("\n") > final = [] > the_id = "init" > for line in l: > if line[0:15] == 'WARC-Record-ID:': > the_id = line[15:] > d[the_id] = line > final.append(Row(**d)) > return final > > > In [8]: rdd2 = rdd1.map(process_file) > In [9]: rdd2.take(1) > > > <lhkgadbhdpeiihec.png> > > > 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on > ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for > exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, > ip-172-31-35-32.us-west-2.compute.internal, executor 2): ExecutorLostFailure > (executor 2 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on > ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for > exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4, > ip-172-31-45-106.us-west-2.compute.internal, executor 3): ExecutorLostFailure > (executor 3 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on > ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for > exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5, > ip-172-31-35-32.us-west-2.compute.internal, executor 4): ExecutorLostFailure > (executor 4 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > > -- > Henry Tremblay > Robert Half Technology