closure issues: wholeTextFiles
Hi, I can understand facing closure issues while executing this code: package spark //this package is about understanding closures as mentioned in: http://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures- import org.apache.spark.sql.SparkSession object understandClosures extends App { var counter = 0 //the error thrown is removed in case we use local[*] as master val sparkSession = SparkSession .builder .master("spark://Gouravs-iMac:7077") //.master("local[*]") .appName("test") .getOrCreate() val valueRDD = sparkSession.sparkContext.parallelize(1 until 100) println(valueRDD.count()) valueRDD.foreach(x => counter += x) //but even if we use the master as local[*] the total appears as some random number as -1234435435 println("the value is " + counter.toString()) sparkSession.close() } Can anyone explain me why am I facing closure issue while executing this code? package spark import org.apache.spark.sql.SparkSession // Import this utility for working with URLs. Unlike Java the semicolon ';' is not required. import java.net.URL // Use {...} to provide a list of things to import, when you don't want to import everything // in a package and you don't want to write a separate line for each type. import java.io.{File, BufferedInputStream, BufferedOutputStream, FileOutputStream} object justenoughTest extends App { val sparkSession = SparkSession .builder .master("spark://Gouravs-iMac:7077") //.master("local[*]") .appName("test") .getOrCreate() println(sparkSession.version) println("Spark version: " + sparkSession.version) println("Spark master: " + sparkSession.sparkContext.master) println("Running 'locally'?: " + sparkSession.sparkContext.isLocal) val pathSeparator = File.separator // The target directory, which we'll now create, if necessary. val shakespeare = new File("/Users/gouravsengupta/Development/data/shakespeare") println(sparkSession.version) //val fileContents = sparkSession.read.text("file:///Users/gouravsengupta/Development/data/shakespeare/") //val fileContents = sparkSession.read.text(shakespeare.toString) val fileContents = sparkSession.sparkContext.wholeTextFiles(shakespeare.toString) println(fileContents.count()) //I am facing the closure issues below val testThis = fileContents.foreach(x => "printing value" + x._1) sparkSession.close() } Regards, Gourav Sengupta
small job runs out of memory using wholeTextFiles
As part of my processing, I have the following code: rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10) rdd.count() The s3 directory has about 8GB of data and 61,878 files. I am using Spark 2.1, and running it with 15 modes of m3.xlarge nodes on EMR. The job fails with this error: : org.apache.spark.SparkException: Job aborted due to stage failure: Task 35532 in stage 0.0 failed 4 times, most recent failure: Lost task 35532.3 in stage 0.0 (TID 35543, ip-172-31-36-192.us-west-2.compute.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 7.4 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. I have run it dozens of times, increasing partitions, reducing the size of my data set (the original is 60GB), and increasing the number of partitions, but get the same error each time. In contrast, if I run a simple: rdd = sc.textFile("s3://paulhtremblay/noaa_tmp/") rdd.coutn() The job finishes in 15 minutes, even with just 3 nodes. Thanks -- Paul Henry Tremblay Robert Half Technology
Re: wholeTextfiles not parallel, runs out of memory
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum > On 14 Feb 2017, at 09:36, Henry Tremblay <paulhtremb...@gmail.com> wrote: > > When I use wholeTextFiles, spark does not run in parallel, and yarn runs out > of memory. > I have documented the steps below. First I copy 6 s3 files to hdfs. Then I > create an rdd by: > > > sc.wholeTextFiles("/mnt/temp") > > > Then I process the files line by line using a simple function. When I look at > my nodes, I see only one executor is running. (I assume the other is the name > node?) I then get an error message that yarn has run out of memory. > > > Steps below: > > > > [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp > Found 6 items > -rw-r--r-- 3 hadoop hadoop3684566 2017-02-14 07:58 > /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop3486510 2017-02-14 08:01 > /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop3498649 2017-02-14 08:05 > /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop4007644 2017-02-14 08:06 > /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop3990553 2017-02-14 08:07 > /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz > -rw-r--r-- 3 hadoop hadoop3689213 2017-02-14 07:54 > /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz > > > In [6]: rdd1 = sc.wholeTextFiles("mnt/temp" > In [7]: rdd1.count() > Out[7]: 6 > > def process_file(s): > text = s[1] > d = {} > l = text.split("\n") > final = [] > the_id = "init" > for line in l: > if line[0:15] == 'WARC-Record-ID:': > the_id = line[15:] > d[the_id] = line > final.append(Row(**d)) > return final > > > In [8]: rdd2 = rdd1.map(process_file) > In [9]: rdd2.take(1) > > > > > > 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on > ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for > exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, > ip-172-31-35-32.us-west-2.compute.internal, executor 2): ExecutorLostFailure > (executor 2 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on > ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for > exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4, > ip-172-31-45-106.us-west-2.compute.internal, executor 3): ExecutorLostFailure > (executor 3 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on > ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for > exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5, > ip-172-31-35-32.us-west-2.compute.internal, executor 4): ExecutorLostFailure > (executor 4 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container > killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > > -- > Henry Tremblay > Robert Half Technology
Re: wholeTextFiles fails, but textFile succeeds for same path
51,000 files at about 1/2 MB per file. I am wondering if I need this http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html Although if I am understanding you correctly, even if I copy the S3 files to HDFS on EMR, and use wholeTextFiles, I am still only going to be able to use a single executor? Henry On 02/11/2017 01:03 PM, Jörn Franke wrote: Can you post more information about the number of files, their size and the executor logs. A gzipped file is not splittable i.e. Only one executor can gunzip it (the unzipped data can then be processed in parallel). Wholetextfile was designed to be executed only on one executor (e.g. For processing xmls which are difficult to process in parallel). Then, if you have small files (< HDFS blocksize) they are also only processed on one executor by default. You may repartition though for parallel processing in even those cases. On 11 Feb 2017, at 21:40, Paul Tremblay <paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote: I've been working on this problem for several days (I am doing more to increase my knowledge of Spark). The code you linked to hangs because after reading in the file, I have to gunzip it. Another way that seems to be working is reading each file in using sc.textFile, and then writing it the HDFS, and then using wholeTextFiles for the HDFS result. But the bigger issue is that both methods are not executed in parallel. When I open my yarn manager, it shows that only one node is being used. Henry On 02/06/2017 03:39 PM, Jon Gregg wrote: Strange that it's working for some directories but not others. Looks like wholeTextFiles maybe doesn't work with S3? https://issues.apache.org/jira/browse/SPARK-4414 . If it's possible to load the data into EMR and run Spark from there that may be a workaround. This blogspot shows a python workaround that might work as well: http://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote: I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd = sc.wholeTextFiles(in_path) rdd.take(1) /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1341 1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p) 1344 1345 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 963 # SparkContext#runJob. 964 mappedRDD = rdd.mapPartitions(partitionFunc) --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc <http://jsc.sc>(), mappedRDD._jrdd, partitions) 966 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 967 /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name <http://self.name>) 1134 1135 for temp_arg in temp_args: /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage fail
Re: wholeTextFiles fails, but textFile succeeds for same path
Can you post more information about the number of files, their size and the executor logs. A gzipped file is not splittable i.e. Only one executor can gunzip it (the unzipped data can then be processed in parallel). Wholetextfile was designed to be executed only on one executor (e.g. For processing xmls which are difficult to process in parallel). Then, if you have small files (< HDFS blocksize) they are also only processed on one executor by default. You may repartition though for parallel processing in even those cases. > On 11 Feb 2017, at 21:40, Paul Tremblay <paulhtremb...@gmail.com> wrote: > > I've been working on this problem for several days (I am doing more to > increase my knowledge of Spark). The code you linked to hangs because after > reading in the file, I have to gunzip it. > Another way that seems to be working is reading each file in using > sc.textFile, and then writing it the HDFS, and then using wholeTextFiles for > the HDFS result. > But the bigger issue is that both methods are not executed in parallel. When > I open my yarn manager, it shows that only one node is being used. > > Henry > >> On 02/06/2017 03:39 PM, Jon Gregg wrote: >> Strange that it's working for some directories but not others. Looks like >> wholeTextFiles maybe doesn't work with S3? >> https://issues.apache.org/jira/browse/SPARK-4414 . >> >> If it's possible to load the data into EMR and run Spark from there that may >> be a workaround. This blogspot shows a python workaround that might work as >> well: http://michaelryanbell.com/processing-whole-files-spark-s3.html >> >> Jon >> >> >> On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com> >> wrote: >>> I've actually been able to trace the problem to the files being read in. If >>> I change to a different directory, then I don't get the error. Is one of >>> the executors running out of memory? >>> >>> >>> >>> >>> >>>> On 02/06/2017 02:35 PM, Paul Tremblay wrote: >>>> When I try to create an rdd using wholeTextFiles, I get an >>>> incomprehensible error. But when I use the same path with sc.textFile, I >>>> get no error. >>>> >>>> I am using pyspark with spark 2.1. >>>> >>>> in_path = >>>> 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ >>>> >>>> rdd = sc.wholeTextFiles(in_path) >>>> >>>> rdd.take(1) >>>> >>>> >>>> /usr/lib/spark/python/pyspark/rdd.py in take(self, num) >>>>1341 >>>>1342 p = range(partsScanned, min(partsScanned + >>>> numPartsToTry, totalParts)) >>>> -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p) >>>>1344 >>>>1345 items += res >>>> >>>> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, >>>> partitionFunc, partitions, allowLocal) >>>> 963 # SparkContext#runJob. >>>> 964 mappedRDD = rdd.mapPartitions(partitionFunc) >>>> --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), >>>> mappedRDD._jrdd, partitions) >>>> 966 return list(_load_from_socket(port, >>>> mappedRDD._jrdd_deserializer)) >>>> 967 >>>> >>>> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in >>>> __call__(self, *args) >>>>1131 answer = self.gateway_client.send_command(command) >>>>1132 return_value = get_return_value( >>>> -> 1133 answer, self.gateway_client, self.target_id, self.name) >>>>1134 >>>>1135 for temp_arg in temp_args: >>>> >>>> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) >>>> 61 def deco(*a, **kw): >>>> 62 try: >>>> ---> 63 return f(*a, **kw) >>>> 64 except py4j.protocol.Py4JJavaError as e: >>>> 65 s = e.java_exception.toString() >>>> >>>> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in >>>> get_return_value(answer, gateway_client, target_id, name) >>>> 317 raise Py4JJavaError( >>>> 318 "An error occurred while calling {0}{1}{2}.\n". >>>> --> 319 format
Re: wholeTextFiles fails, but textFile succeeds for same path
I've been working on this problem for several days (I am doing more to increase my knowledge of Spark). The code you linked to hangs because after reading in the file, I have to gunzip it. Another way that seems to be working is reading each file in using sc.textFile, and then writing it the HDFS, and then using wholeTextFiles for the HDFS result. But the bigger issue is that both methods are not executed in parallel. When I open my yarn manager, it shows that only one node is being used. Henry On 02/06/2017 03:39 PM, Jon Gregg wrote: Strange that it's working for some directories but not others. Looks like wholeTextFiles maybe doesn't work with S3? https://issues.apache.org/jira/browse/SPARK-4414 . If it's possible to load the data into EMR and run Spark from there that may be a workaround. This blogspot shows a python workaround that might work as well: http://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote: I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd = sc.wholeTextFiles(in_path) rdd.take(1) /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1341 1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p) 1344 1345 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 963 # SparkContext#runJob. 964 mappedRDD = rdd.mapPartitions(partitionFunc) --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc <http://jsc.sc>(), mappedRDD._jrdd, partitions) 966 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 967 /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name <http://self.name>) 1134 1135 for temp_arg in temp_args: /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, ip-172-31-45-114.us-west-2.com <http://ip-172-31-45-114.us-west-2.com>pute.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486415078210_0005_01_16 on host: ip-172-31-45-114.us-west-2.com <http://ip-172-31-45-114.us-west-2.com>pute.internal. Exit status: 52. Diagnostics: Exception from container-launch. Container id: container_1486415078210_0005_01_16 Exit code: 52 Stack trace: ExitCodeException exitCode=52: at org.apache.hadoop.util.Shell.runCommand(Shell.java:582) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.yarn.server.nodemanager.Def
Re: wholeTextFiles fails, but textFile succeeds for same path
Strange that it's working for some directories but not others. Looks like wholeTextFiles maybe doesn't work with S3? https://issues.apache.org/jira/browse/SPARK-4414 . If it's possible to load the data into EMR and run Spark from there that may be a workaround. This blogspot shows a python workaround that might work as well: http://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com> wrote: > I've actually been able to trace the problem to the files being read in. > If I change to a different directory, then I don't get the error. Is one of > the executors running out of memory? > > > > > > On 02/06/2017 02:35 PM, Paul Tremblay wrote: > >> When I try to create an rdd using wholeTextFiles, I get an >> incomprehensible error. But when I use the same path with sc.textFile, I >> get no error. >> >> I am using pyspark with spark 2.1. >> >> in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/148069 >> 8542939.6/warc/ >> >> rdd = sc.wholeTextFiles(in_path) >> >> rdd.take(1) >> >> >> /usr/lib/spark/python/pyspark/rdd.py in take(self, num) >>1341 >>1342 p = range(partsScanned, min(partsScanned + >> numPartsToTry, totalParts)) >> -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p) >>1344 >>1345 items += res >> >> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, >> partitionFunc, partitions, allowLocal) >> 963 # SparkContext#runJob. >> 964 mappedRDD = rdd.mapPartitions(partitionFunc) >> --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), >> mappedRDD._jrdd, partitions) >> 966 return list(_load_from_socket(port, >> mappedRDD._jrdd_deserializer)) >> 967 >> >> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in >> __call__(self, *args) >>1131 answer = self.gateway_client.send_command(command) >>1132 return_value = get_return_value( >> -> 1133 answer, self.gateway_client, self.target_id, >> self.name) >>1134 >>1135 for temp_arg in temp_args: >> >> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) >> 61 def deco(*a, **kw): >> 62 try: >> ---> 63 return f(*a, **kw) >> 64 except py4j.protocol.Py4JJavaError as e: >> 65 s = e.java_exception.toString() >> >> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in >> get_return_value(answer, gateway_client, target_id, name) >> 317 raise Py4JJavaError( >> 318 "An error occurred while calling >> {0}{1}{2}.\n". >> --> 319 format(target_id, ".", name), value) >> 320 else: >> 321 raise Py4JError( >> >> Py4JJavaError: An error occurred while calling >> z:org.apache.spark.api.python.PythonRDD.runJob. >> : org.apache.spark.SparkException: Job aborted due to stage failure: >> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in >> stage 1.0 (TID 7, ip-172-31-45-114.us-west-2.compute.internal, executor >> 8): ExecutorLostFailure (executor 8 exited caused by one of the running >> tasks) Reason: Container marked as failed: >> container_1486415078210_0005_01_16 >> on host: ip-172-31-45-114.us-west-2.compute.internal. Exit status: 52. >> Diagnostics: Exception from container-launch. >> Container id: container_1486415078210_0005_01_16 >> Exit code: 52 >> Stack trace: ExitCodeException exitCode=52: >> at org.apache.hadoop.util.Shell.runCommand(Shell.java:582) >> at org.apache.hadoop.util.Shell.run(Shell.java:479) >> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Sh >> ell.java:773) >> at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerEx >> ecutor.launchContainer(DefaultContainerExecutor.java:212) >> at org.apache.hadoop.yarn.server.nodemanager.containermanager.l >> auncher.ContainerLaunch.call(ContainerLaunch.java:302) >> at org.apache.hadoop.yarn.server.nodemanager.containermanager.l >> auncher.ContainerLaunch.call(ContainerLaunch.java:82) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> rdd = sc.textFile(in_path) >> >> In [8]: rdd.take(1) >> Out[8]: [u'WARC/1.0'] >> >> > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: wholeTextFiles fails, but textFile succeeds for same path
I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd = sc.wholeTextFiles(in_path) rdd.take(1) /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1341 1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p) 1344 1345 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 963 # SparkContext#runJob. 964 mappedRDD = rdd.mapPartitions(partitionFunc) --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) 966 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 967 /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, ip-172-31-45-114.us-west-2.compute.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486415078210_0005_01_16 on host: ip-172-31-45-114.us-west-2.compute.internal. Exit status: 52. Diagnostics: Exception from container-launch. Container id: container_1486415078210_0005_01_16 Exit code: 52 Stack trace: ExitCodeException exitCode=52: at org.apache.hadoop.util.Shell.runCommand(Shell.java:582) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) rdd = sc.textFile(in_path) In [8]: rdd.take(1) Out[8]: [u'WARC/1.0'] - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
wholeTextFiles fails, but textFile succeeds for same path
When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd = sc.wholeTextFiles(in_path) rdd.take(1) /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1341 1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p) 1344 1345 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 963 # SparkContext#runJob. 964 mappedRDD = rdd.mapPartitions(partitionFunc) --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) 966 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 967 /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, ip-172-31-45-114.us-west-2.compute.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486415078210_0005_01_16 on host: ip-172-31-45-114.us-west-2.compute.internal. Exit status: 52. Diagnostics: Exception from container-launch. Container id: container_1486415078210_0005_01_16 Exit code: 52 Stack trace: ExitCodeException exitCode=52: at org.apache.hadoop.util.Shell.runCommand(Shell.java:582) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) rdd = sc.textFile(in_path) In [8]: rdd.take(1) Out[8]: [u'WARC/1.0'] - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Handling Input Error in wholeTextFiles
Hi all, I have a requirement to process multiple splittable gzip files and the results need to include each individual file name. I come across a problem when loading multiple gzip files using wholeTextFiles method and some files are corrupted causing ‘unexpected end of input stream' error and the whole job is failed. I tried using flatMap with try statement, still failed. Is there any way to handle this? Regards, Khwunchai Jaengsawang Email: khwuncha...@ku.th Mobile: +66 88 228 1715 LinkedIn <https://linkedin.com/in/khwunchai> | Github <https://github.com/khwunchai> | Facebook <https://www.facebook.com/khwunchai.j> | Google+ <http://google.com/+KhwunchaiJaengsawang>
Re: wholeTextFiles()
Also, in case the issue was not due to the string length (however it is still valid and may get you later), the issue may be due to some other indexing issues which are currently being worked on here https://issues.apache.org/jira/browse/SPARK-6235 On Mon, Dec 12, 2016 at 8:18 PM, Jakob Oderskywrote: > Hi Pradeep, > > I'm afraid you're running into a hard Java issue. Strings are indexed > with signed integers and can therefore not be longer than > approximately 2 billion characters. Could you use `textFile` as a > workaround? It will give you an RDD of the files' lines instead. > > In general, this guide http://spark.apache.org/contributing.html gives > information on how to contribute to spark, including instructions on > how to file bug reports (which does not apply in this case as it isn't > a bug in Spark). > > regards, > --Jakob > > On Mon, Dec 12, 2016 at 7:34 PM, Pradeep wrote: >> Hi, >> >> Why there is an restriction on max file size that can be read by >> wholeTextFile() method. >> >> I can read a 1.5 gigs file but get Out of memory for 2 gig file. >> >> Also, how can I raise this as an defect in spark jira. Can someone please >> guide. >> >> Thanks, >> Pradeep >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: wholeTextFiles()
Hi Pradeep, I'm afraid you're running into a hard Java issue. Strings are indexed with signed integers and can therefore not be longer than approximately 2 billion characters. Could you use `textFile` as a workaround? It will give you an RDD of the files' lines instead. In general, this guide http://spark.apache.org/contributing.html gives information on how to contribute to spark, including instructions on how to file bug reports (which does not apply in this case as it isn't a bug in Spark). regards, --Jakob On Mon, Dec 12, 2016 at 7:34 PM, Pradeepwrote: > Hi, > > Why there is an restriction on max file size that can be read by > wholeTextFile() method. > > I can read a 1.5 gigs file but get Out of memory for 2 gig file. > > Also, how can I raise this as an defect in spark jira. Can someone please > guide. > > Thanks, > Pradeep > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
wholeTextFiles()
Hi, Why there is an restriction on max file size that can be read by wholeTextFile() method. I can read a 1.5 gigs file but get Out of memory for 2 gig file. Also, how can I raise this as an defect in spark jira. Can someone please guide. Thanks, Pradeep - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?
Well I have already tried that. You are talking about a command similar to this right? *yarn logs -applicationId application_Number * This gives me the processing logs, that contain information about the tasks, RDD blocks etc. What I really need is the output log that gets generated as part of the Spark job. Which means I generate some output by the Spark job that gets written to a file mentioned in the job itself. So this file is currently residing within the appcache, is there a way that I can get this once the job is over? On Wed, Sep 21, 2016 at 4:00 PM, ayan guha <guha.a...@gmail.com> wrote: > On yarn, logs are aggregated from each containers to hdfs. You can use > yarn CLI or ui to view. For spark, you would have a history server which > consolidate s the logs > On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote: > >> I looked at the driver logs, that reminded me that I needed to look at >> the executor logs. There the issue was that the spark executors were not >> getting a configuration file. I broadcasted the file and now the processing >> happens. Thanks for the suggestion. >> Currently my issue is that the log file generated independently by the >> executors goes to the respective containers' appcache, and then it gets >> lost. Is there a recommended way to get the output files from the >> individual executors? >> >> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com> >> wrote: >> >>> Are you looking at the worker logs or the driver? >>> >>> >>> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> >>> wrote: >>> >>>> I have an RDD created as follows: >>>> >>>> *JavaPairRDD<String,String> inputDataFiles = >>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* >>>> >>>> On this RDD I perform a map to process individual files and invoke a >>>> foreach to trigger the same map. >>>> >>>>* JavaRDD<Object[]> output = inputDataFiles.map(new >>>> Function<Tuple2<String,String>,Object[]>()* >>>> *{* >>>> >>>> *private static final long serialVersionUID = 1L;* >>>> >>>> * @Override* >>>> * public Object[] call(Tuple2<String,String> v1) throws Exception * >>>> *{ * >>>> * System.out.println("in map!");* >>>> * //do something with v1. * >>>> * return Object[]* >>>> *} * >>>> *});* >>>> >>>> *output.foreach(new VoidFunction<Object[]>() {* >>>> >>>> * private static final long serialVersionUID = 1L;* >>>> >>>> * @Override* >>>> * public void call(Object[] t) throws Exception {* >>>> * //do nothing!* >>>> * System.out.println("in foreach!");* >>>> * }* >>>> * }); * >>>> >>>> This code works perfectly fine for standalone setup on my local laptop >>>> while accessing both local files as well as remote HDFS files. >>>> >>>> In cluster the same code produces no results. My intuition is that the >>>> data has not reached the individual executors and hence both the `map` and >>>> `foreach` does not work. It might be a guess. But I am not able to figure >>>> out why this would not work in cluster. I dont even see the print >>>> statements in `map` and `foreach` getting printed in cluster mode of >>>> execution. >>>> >>>> I notice a particular line in standalone output that I do NOT see in >>>> cluster execution. >>>> >>>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: >>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* >>>> >>>> I had a similar code with textFile() that worked earlier for individual >>>> files on cluster. The issue is with wholeTextFiles() only. >>>> >>>> Please advise what is the best way to get this working or other >>>> alternate ways. >>>> >>>> My setup is cloudera 5.7 distribution with Spark Service. I used the >>>> master as `yarn-client`. >>>> >>>> The action can be anything. Its just a dummy step to invoke the map. I >>>> also tried *System.out.println("Count is:"+output.count());*, for >>>> which I got the correct answer of `10`, since there were 10 files in the >>>> folder, but still the map refuses to work. >>>> >>>> Thanks. >>>> >>>> >>> >>> -- >>> Thanks, >>> Sonal >>> Nube Technologies <http://www.nubetech.co> >>> >>> <http://in.linkedin.com/in/sonalgoyal> >>> >>> >>> >>> >> >> >> -- >> Nisha Menon >> BTech (CS) Sahrdaya CET, >> MTech (CS) IIIT Banglore. >> > -- Nisha Menon BTech (CS) Sahrdaya CET, MTech (CS) IIIT Banglore.
Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?
On yarn, logs are aggregated from each containers to hdfs. You can use yarn CLI or ui to view. For spark, you would have a history server which consolidate s the logs On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote: > I looked at the driver logs, that reminded me that I needed to look at the > executor logs. There the issue was that the spark executors were not > getting a configuration file. I broadcasted the file and now the processing > happens. Thanks for the suggestion. > Currently my issue is that the log file generated independently by the > executors goes to the respective containers' appcache, and then it gets > lost. Is there a recommended way to get the output files from the > individual executors? > > On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com> > wrote: > >> Are you looking at the worker logs or the driver? >> >> >> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> >> wrote: >> >>> I have an RDD created as follows: >>> >>> *JavaPairRDD<String,String> inputDataFiles = >>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* >>> >>> On this RDD I perform a map to process individual files and invoke a >>> foreach to trigger the same map. >>> >>>* JavaRDD<Object[]> output = inputDataFiles.map(new >>> Function<Tuple2<String,String>,Object[]>()* >>> *{* >>> >>> *private static final long serialVersionUID = 1L;* >>> >>> * @Override* >>> * public Object[] call(Tuple2<String,String> v1) throws Exception * >>> *{ * >>> * System.out.println("in map!");* >>> * //do something with v1. * >>> * return Object[]* >>> *} * >>> *});* >>> >>> *output.foreach(new VoidFunction<Object[]>() {* >>> >>> * private static final long serialVersionUID = 1L;* >>> >>> * @Override* >>> * public void call(Object[] t) throws Exception {* >>> * //do nothing!* >>> * System.out.println("in foreach!");* >>> * }* >>> * }); * >>> >>> This code works perfectly fine for standalone setup on my local laptop >>> while accessing both local files as well as remote HDFS files. >>> >>> In cluster the same code produces no results. My intuition is that the >>> data has not reached the individual executors and hence both the `map` and >>> `foreach` does not work. It might be a guess. But I am not able to figure >>> out why this would not work in cluster. I dont even see the print >>> statements in `map` and `foreach` getting printed in cluster mode of >>> execution. >>> >>> I notice a particular line in standalone output that I do NOT see in >>> cluster execution. >>> >>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: >>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* >>> >>> I had a similar code with textFile() that worked earlier for individual >>> files on cluster. The issue is with wholeTextFiles() only. >>> >>> Please advise what is the best way to get this working or other >>> alternate ways. >>> >>> My setup is cloudera 5.7 distribution with Spark Service. I used the >>> master as `yarn-client`. >>> >>> The action can be anything. Its just a dummy step to invoke the map. I >>> also tried *System.out.println("Count is:"+output.count());*, for which >>> I got the correct answer of `10`, since there were 10 files in the folder, >>> but still the map refuses to work. >>> >>> Thanks. >>> >>> >> >> -- >> Thanks, >> Sonal >> Nube Technologies <http://www.nubetech.co> >> >> <http://in.linkedin.com/in/sonalgoyal> >> >> >> >> > > > -- > Nisha Menon > BTech (CS) Sahrdaya CET, > MTech (CS) IIIT Banglore. >
How does wholeTextFiles() work in Spark-Hadoop Cluster?
I looked at the driver logs, that reminded me that I needed to look at the executor logs. There the issue was that the spark executors were not getting a configuration file. I broadcasted the file and now the processing happens. Thanks for the suggestion. Currently my issue is that the log file generated independently by the executors goes to the respective containers' appcache, and then it gets lost. Is there a recommended way to get the output files from the individual executors? On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > Are you looking at the worker logs or the driver? > > > On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> > wrote: > >> I have an RDD created as follows: >> >> *JavaPairRDD<String,String> inputDataFiles = >> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* >> >> On this RDD I perform a map to process individual files and invoke a >> foreach to trigger the same map. >> >>* JavaRDD<Object[]> output = inputDataFiles.map(new >> Function<Tuple2<String,String>,Object[]>()* >> *{* >> >> *private static final long serialVersionUID = 1L;* >> >> * @Override* >> * public Object[] call(Tuple2<String,String> v1) throws Exception * >> *{ * >> * System.out.println("in map!");* >> * //do something with v1. * >> * return Object[]* >> *} * >> *});* >> >> *output.foreach(new VoidFunction<Object[]>() {* >> >> * private static final long serialVersionUID = 1L;* >> >> * @Override* >> * public void call(Object[] t) throws Exception {* >> * //do nothing!* >> * System.out.println("in foreach!");* >> * }* >> * }); * >> >> This code works perfectly fine for standalone setup on my local laptop >> while accessing both local files as well as remote HDFS files. >> >> In cluster the same code produces no results. My intuition is that the >> data has not reached the individual executors and hence both the `map` and >> `foreach` does not work. It might be a guess. But I am not able to figure >> out why this would not work in cluster. I dont even see the print >> statements in `map` and `foreach` getting printed in cluster mode of >> execution. >> >> I notice a particular line in standalone output that I do NOT see in >> cluster execution. >> >> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: >> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* >> >> I had a similar code with textFile() that worked earlier for individual >> files on cluster. The issue is with wholeTextFiles() only. >> >> Please advise what is the best way to get this working or other alternate >> ways. >> >> My setup is cloudera 5.7 distribution with Spark Service. I used the >> master as `yarn-client`. >> >> The action can be anything. Its just a dummy step to invoke the map. I >> also tried *System.out.println("Count is:"+output.count());*, for which >> I got the correct answer of `10`, since there were 10 files in the folder, >> but still the map refuses to work. >> >> Thanks. >> >> > > -- > Thanks, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > -- Nisha Menon BTech (CS) Sahrdaya CET, MTech (CS) IIIT Banglore.
Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?
Are you looking at the worker logs or the driver? On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> wrote: > I have an RDD created as follows: > > *JavaPairRDD<String,String> inputDataFiles = > sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* > > On this RDD I perform a map to process individual files and invoke a > foreach to trigger the same map. > >* JavaRDD<Object[]> output = inputDataFiles.map(new > Function<Tuple2<String,String>,Object[]>()* > *{* > > *private static final long serialVersionUID = 1L;* > > * @Override* > * public Object[] call(Tuple2<String,String> v1) throws Exception * > *{ * > * System.out.println("in map!");* > * //do something with v1. * > * return Object[]* > *} * > *});* > > *output.foreach(new VoidFunction<Object[]>() {* > > * private static final long serialVersionUID = 1L;* > > * @Override* > * public void call(Object[] t) throws Exception {* > * //do nothing!* > * System.out.println("in foreach!");* > * }* > * }); * > > This code works perfectly fine for standalone setup on my local laptop > while accessing both local files as well as remote HDFS files. > > In cluster the same code produces no results. My intuition is that the > data has not reached the individual executors and hence both the `map` and > `foreach` does not work. It might be a guess. But I am not able to figure > out why this would not work in cluster. I dont even see the print > statements in `map` and `foreach` getting printed in cluster mode of > execution. > > I notice a particular line in standalone output that I do NOT see in > cluster execution. > > *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: > Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* > > I had a similar code with textFile() that worked earlier for individual > files on cluster. The issue is with wholeTextFiles() only. > > Please advise what is the best way to get this working or other alternate > ways. > > My setup is cloudera 5.7 distribution with Spark Service. I used the > master as `yarn-client`. > > The action can be anything. Its just a dummy step to invoke the map. I > also tried *System.out.println("Count is:"+output.count());*, for which I > got the correct answer of `10`, since there were 10 files in the folder, > but still the map refuses to work. > > Thanks. > > -- Thanks, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal>
How does wholeTextFiles() work in Spark-Hadoop Cluster?
I have an RDD created as follows: *JavaPairRDD<String,String> inputDataFiles = sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* On this RDD I perform a map to process individual files and invoke a foreach to trigger the same map. * JavaRDD<Object[]> output = inputDataFiles.map(new Function<Tuple2<String,String>,Object[]>()* *{* *private static final long serialVersionUID = 1L;* * @Override* * public Object[] call(Tuple2<String,String> v1) throws Exception * *{ * * System.out.println("in map!");* * //do something with v1. * * return Object[]* *} * *});* *output.foreach(new VoidFunction<Object[]>() {* * private static final long serialVersionUID = 1L;* * @Override* * public void call(Object[] t) throws Exception {* * //do nothing!* * System.out.println("in foreach!");* * }* * }); * This code works perfectly fine for standalone setup on my local laptop while accessing both local files as well as remote HDFS files. In cluster the same code produces no results. My intuition is that the data has not reached the individual executors and hence both the `map` and `foreach` does not work. It might be a guess. But I am not able to figure out why this would not work in cluster. I dont even see the print statements in `map` and `foreach` getting printed in cluster mode of execution. I notice a particular line in standalone output that I do NOT see in cluster execution. *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* I had a similar code with textFile() that worked earlier for individual files on cluster. The issue is with wholeTextFiles() only. Please advise what is the best way to get this working or other alternate ways. My setup is cloudera 5.7 distribution with Spark Service. I used the master as `yarn-client`. The action can be anything. Its just a dummy step to invoke the map. I also tried *System.out.println("Count is:"+output.count());*, for which I got the correct answer of `10`, since there were 10 files in the folder, but still the map refuses to work. Thanks.
Re: Issue with wholeTextFiles
Can you paste the exception stack here? Thanks Best Regards On Mon, Mar 21, 2016 at 1:42 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > I'm using Hadoop 1.0.4 and Spark 1.2.0. > > I'm facing a strange issue. I have a requirement to read a small file from > HDFS and all it's content has to be read at one shot. So I'm using spark > context's wholeTextFiles API passing the HDFS URL for the file. > > When I try this from a spark shell it's works as mentioned in the > documentation, but when I try the same through program (by submitting job > to cluster) I get FileNotFoundException. I have all compatible JARs in > place. > > Please help. > > >
Issue with wholeTextFiles
I'm using Hadoop 1.0.4 and Spark 1.2.0. I'm facing a strange issue. I have a requirement to read a small file from HDFS and all it's content has to be read at one shot. So I'm using spark context's wholeTextFiles API passing the HDFS URL for the file. When I try this from a spark shell it's works as mentioned in the documentation, but when I try the same through program (by submitting job to cluster) I get FileNotFoundException. I have all compatible JARs in place. Please help.
wholeTextFiles(/x/*/*.txt) runs single threaded
Hi, I got a cluster of 4 machines and I sc.wholeTextFiles(/x/*/*.txt) folder x contains subfolders and each subfolder contains thousand of files with a total of ~1million matching the path expression. My spark task starts processing the files but single threaded. I can see that in the sparkUI, only 1 executor is used out of 4. And only 1 thread out of configured 24: spark-submit --class com.stratified.articleids.NxmlExtractorJob \ --driver-memory 8g \ --executor-memory 8g \ --num-executors 4 \ --executor-cores 16 \ --master yarn-cluster \ --conf spark.akka.frameSize=128 \ $JAR My actual code is : val rdd=extractIds(sc.wholeTextFiles(xmlDir)) rdd.saveAsObjectFile(serDir) Is the saveAsObjectFile causing this and any workarounds? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: wholeTextFiles(/x/*/*.txt) runs single threaded
In SparkUI I can see it creating 2 stages. I tried wholeTextFiles().repartition(32) but same threading results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
wholeTextFiles on 20 nodes
I have 20 nodes via EC2 and an application that reads the data via wholeTextFiles. I've tried to copy the data into hadoop via copyFromLocal, and I get 14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad connect ack with firstBadLink as X:50010 14/11/24 02:00:07 INFO hdfs.DFSClient: Abandoning block blk_-8725559184260876712_2627 14/11/24 02:00:07 INFO hdfs.DFSClient: Excluding datanode X:50010 a lot. Then I went the file system route via copy-dir, which worked well. Now everything is under /root/txt on all nodes. I submitted the job with the file:///root/txt/ directory for wholeTextFiles() and I get Exception in thread main java.io.FileNotFoundException: File does not exist: /root/txt/3521.txt The file exists on the root note and should be everywhere according to copy-dir. The hadoop variant worked fine with 3 nodes, but it starts bugging with 20. I added property namedfs.datanode.max.transfer.threads/name value4096/value /property to hdfs-site.xml and core-site.xml, didn't help. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: wholeTextFiles not working with HDFS
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue seems related to Hadoop1. When switching to using spark-1.0.2-bin-hadoop*2*, the issue disappears. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12677.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: wholeTextFiles not working with HDFS
I have the same issue val a = sc.textFile(s3n://MyBucket/MyFolder/*.tif) a.first works perfectly fine, but val d = sc.wholeTextFiles(s3n://MyBucket/MyFolder/*.tif) does not work d.first Gives the following error message java.io.FileNotFoundException: File /MyBucket/MyFolder.tif does not exist. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10505.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
That worked for me as well, I was using spark 1.0 compiled against Hadoop 1.0, switching to 1.0.1 compiled against hadoop 2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles like for binary files ?
You cannot read image files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gripped files because they are not splittable http://www.bigdataspeak.com/2013_01_01_archive.html (source proving it): override def createRecordReader( split: InputSplit, context: TaskAttemptContext): RecordReader[String, String] = { new CombineFileRecordReader[String, String]( split.asInstanceOf[CombineFileSplit], context, classOf[WholeTextFileRecordReader]) } You may be able to use newAPIHadoopFile with wholefileinputformat https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java (not built into hadoop but all over the internet) to get this to work correctly. I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat. Thanks Best Regards On Thu, Jun 26, 2014 at 3:31 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Is there an equivalent of wholeTextFiles for binary files for example a set of images ? Cheers, Jaonary
wholeTextFiles and gzip
Interesting question on Stack Overflow: http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles Is it possible to read gzipped files using wholeTextFiles()? Alternately, is it possible to read the source file names using textFile()? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-and-gzip-tp8283.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
wholeTextFiles like for binary files ?
Is there an equivalent of wholeTextFiles for binary files for example a set of images ? Cheers, Jaonary
Re: wholeTextFiles not working with HDFS
I didn't fix the issue so much as work around it. I was running my cluster locally, so using HDFS was just a preference. The code worked with the local file system, so that's what I'm using until I can get some help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
Hi Sguj and littlebird, I'll try to fix it tomorrow evening and the day after tomorrow, because I am now busy preparing a talk (slides) tomorrow. Sorry for the inconvenience to you. Would you mind to write an issue on Spark JIRA? 2014-06-17 20:55 GMT+08:00 Sguj tpcome...@yahoo.com: I didn't fix the issue so much as work around it. I was running my cluster locally, so using HDFS was just a preference. The code worked with the local file system, so that's what I'm using until I can get some help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Best Regards --- Xusen Yin(尹绪森) Intel Labs China Homepage: *http://yinxusen.github.io/ http://yinxusen.github.io/*
Re: wholeTextFiles not working with HDFS
I can write one if you'll point me to where I need to write it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
Hi, I have the same exception. Can you tell me how did you fix it? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
Hi guys, I ran into the same exception (while trying the same example), and after overriding hadoop-client artifact in my pom.xml, I got another error (below). System config: ubuntu 12.04 intellijj 13. scala 2.10.3 maven: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.0.0/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency Any idea why spark 1.0 is incompatible with Hadoop 2? Thanks for your support in advance! Exception in thread main java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:186) at org.apache.hadoop.mapreduce.SparkHadoopMapReduceUtil$class.firstAvailableClass(SparkHadoopMapReduceUtil.scala:73) at org.apache.hadoop.mapreduce.SparkHadoopMapReduceUtil$class.newJobContext(SparkHadoopMapReduceUtil.scala:27) at org.apache.spark.rdd.NewHadoopRDD.newJobContext(NewHadoopRDD.scala:61) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:171) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094) at org.apache.spark.rdd.RDD.collect(RDD.scala:717) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p7563.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
My exception stack looks about the same. java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094) at org.apache.spark.rdd.RDD.collect(RDD.scala:717) I'm using Hadoop 1.2.1, and everything else I've tried in Spark with that version has worked, so I doubt it's a version error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
Hi Sguj, Could you give me the exception stack? I test it on my laptop and find that it gets the wrong FileSystem. It should be DistributedFileSystem, but it finds the RawLocalFileSystem. If we get the same exception stack, I'll try to fix it. Here is my exception stack: java.io.FileNotFoundException: File /sen/reuters-out/reut2-000.sgm-0.txt does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097) at org.apache.spark.rdd.RDD.collect(RDD.scala:728) Besides, what's your hadoop version? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7548.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected is the classic error meaning you compiled against Hadoop 1, but are running against Hadoop 2 I think you need to override the hadoop-client artifact that Spark depends on to be a Hadoop 2.x version. On Tue, Jun 3, 2014 at 5:23 PM, toivoa toivo@gmail.com wrote: Hi Set up project under Eclipse using Maven: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.0.0/version /dependency Simple example fails: def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster(local) .setAppName(CountingSheep) .set(spark.executor.memory, 1g) val sc = new SparkContext(conf) val indir = src/main/resources/testdata val files = sc.wholeTextFiles(indir, 10) for( pair - files) println(pair._1 + = + pair._2) 14/06/03 19:20:34 ERROR executor.Executor: Exception in task ID 0 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:164) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.init(CombineFileRecordReader.java:126) at org.apache.spark.input.WholeTextFileInputFormat.createRecordReader(WholeTextFileInputFormat.scala:44) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:155) ... 13 more Caused by: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.input.WholeTextFileRecordReader.init(WholeTextFileRecordReader.scala:40) ... 18 more Any idea? thanks toivo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
Hi Set up project under Eclipse using Maven: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.0.0/version /dependency Simple example fails: def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster(local) .setAppName(CountingSheep) .set(spark.executor.memory, 1g) val sc = new SparkContext(conf) val indir = src/main/resources/testdata val files = sc.wholeTextFiles(indir, 10) for( pair - files) println(pair._1 + = + pair._2) 14/06/03 19:20:34 ERROR executor.Executor: Exception in task ID 0 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:164) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.init(CombineFileRecordReader.java:126) at org.apache.spark.input.WholeTextFileInputFormat.createRecordReader(WholeTextFileInputFormat.scala:44) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:155) ... 13 more Caused by: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.input.WholeTextFileRecordReader.init(WholeTextFileRecordReader.scala:40) ... 18 more Any idea? thanks toivo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
Wow! What a quick reply! adding dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency solved the problem. But now I get 14/06/03 19:52:50 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) thanks toivo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
I'd try the internet / SO first -- these are actually generic Hadoop-related issues. Here I think you don't have HADOOP_HOME or similar set. http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path On Tue, Jun 3, 2014 at 5:54 PM, toivoa toivo@gmail.com wrote: Wow! What a quick reply! adding dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency solved the problem. But now I get 14/06/03 19:52:50 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) thanks toivo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs just fine without them. Matei On Jun 3, 2014, at 10:33 AM, Sean Owen so...@cloudera.com wrote: I'd try the internet / SO first -- these are actually generic Hadoop-related issues. Here I think you don't have HADOOP_HOME or similar set. http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path On Tue, Jun 3, 2014 at 5:54 PM, toivoa toivo@gmail.com wrote: Wow! What a quick reply! adding dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency solved the problem. But now I get 14/06/03 19:52:50 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) thanks toivo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html Sent from the Apache Spark User List mailing list archive at Nabble.com.