Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz 
i.e. you will have only one executor per file maximum

> On 14 Feb 2017, at 09:36, Henry Tremblay <paulhtremb...@gmail.com> wrote:
> 
> When I use wholeTextFiles, spark does not run in parallel, and yarn runs out 
> of memory. 
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I 
> create an rdd by:
> 
> 
> sc.wholeTextFiles("/mnt/temp")
> 
> 
> Then I process the files line by line using a simple function. When I look at 
> my nodes, I see only one executor is running. (I assume the other is the name 
> node?) I then get an error message that yarn has run out of memory.
> 
> 
> Steps below:
> 
> ========================
> 
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r--   3 hadoop hadoop    3684566 2017-02-14 07:58 
> /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3486510 2017-02-14 08:01 
> /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3498649 2017-02-14 08:05 
> /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    4007644 2017-02-14 08:06 
> /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3990553 2017-02-14 08:07 
> /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3689213 2017-02-14 07:54 
> /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
> 
> 
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
> In [7]: rdd1.count()
> Out[7]: 6
> 
> def process_file(s):
>     text = s[1]
>     d = {}
>     l =  text.split("\n")
>     final = []
>     the_id = "init"
>     for line in l:
>         if line[0:15] == 'WARC-Record-ID:':
>             the_id = line[15:]
>         d[the_id] = line
>         final.append(Row(**d))
>     return final
> 
> 
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
> 
> 
> <lhkgadbhdpeiihec.png>
> 
> 
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on 
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for 
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, 
> ip-172-31-35-32.us-west-2.compute.internal, executor 2): ExecutorLostFailure 
> (executor 2 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on 
> ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for 
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4, 
> ip-172-31-45-106.us-west-2.compute.internal, executor 3): ExecutorLostFailure 
> (executor 3 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on 
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for 
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5, 
> ip-172-31-35-32.us-west-2.compute.internal, executor 4): ExecutorLostFailure 
> (executor 4 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 
> -- 
> Henry Tremblay
> Robert Half Technology

Reply via email to