closure issues: wholeTextFiles

2018-03-27 Thread Gourav Sengupta
Hi,

I can understand facing closure issues while executing this code:



package spark

//this package is about understanding closures as mentioned in:
http://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-


import org.apache.spark.sql.SparkSession


object understandClosures extends  App {

  var counter = 0

  //the error thrown is removed in case we use local[*] as master
  val sparkSession = SparkSession
.builder
.master("spark://Gouravs-iMac:7077")
//.master("local[*]")
.appName("test")
.getOrCreate()


  val valueRDD = sparkSession.sparkContext.parallelize(1 until 100)

  println(valueRDD.count())

  valueRDD.foreach(x => counter += x)

  //but even if we use the master as local[*] the total appears as
some random number as -1234435435
  println("the value is " + counter.toString())


  sparkSession.close()

}





Can anyone explain me why am I facing closure issue while executing this
code?

package spark

import org.apache.spark.sql.SparkSession
// Import this utility for working with URLs. Unlike Java the
semicolon ';' is not required.
import java.net.URL
// Use {...} to provide a list of things to import, when you don't
want to import everything
// in a package and you don't want to write a separate line for each type.
import java.io.{File, BufferedInputStream, BufferedOutputStream,
FileOutputStream}




object justenoughTest extends App {

  val sparkSession = SparkSession
.builder
.master("spark://Gouravs-iMac:7077")
//.master("local[*]")
.appName("test")
.getOrCreate()

  println(sparkSession.version)

  println("Spark version:  " + sparkSession.version)
  println("Spark master:   " + sparkSession.sparkContext.master)
  println("Running 'locally'?: " + sparkSession.sparkContext.isLocal)

  val pathSeparator = File.separator

  // The target directory, which we'll now create, if necessary.
  val shakespeare = new
File("/Users/gouravsengupta/Development/data/shakespeare")

  println(sparkSession.version)

  //val fileContents =
sparkSession.read.text("file:///Users/gouravsengupta/Development/data/shakespeare/")
  //val fileContents = sparkSession.read.text(shakespeare.toString)
  val fileContents =
sparkSession.sparkContext.wholeTextFiles(shakespeare.toString)
  println(fileContents.count())

  //I am facing  the closure issues below

  val testThis = fileContents.foreach(x => "printing value" + x._1)


sparkSession.close()

}


Regards,
Gourav Sengupta


small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay
As part of my processing, I have the following code:

rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10)
rdd.count()

The s3 directory has about 8GB of data and 61,878 files. I am using Spark
2.1, and running it with 15 modes of m3.xlarge nodes on EMR.

The job fails with this error:

: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 35532 in stage 0.0 failed 4 times, most recent failure: Lost task
35532.3 in stage 0.0 (TID 35543,
ip-172-31-36-192.us-west-2.compute.internal, executor 6):
ExecutorLostFailure (executor 6 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
7.4 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.


I have run it dozens of times, increasing partitions, reducing the size of
my data set (the original is 60GB), and increasing the number of
partitions, but get the same error each time.

In contrast, if I run a simple:

rdd = sc.textFile("s3://paulhtremblay/noaa_tmp/")
rdd.coutn()

The job finishes in 15 minutes, even with just 3 nodes.

Thanks

-- 
Paul Henry Tremblay
Robert Half Technology


Re: wholeTextfiles not parallel, runs out of memory

2017-02-14 Thread Jörn Franke
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz 
i.e. you will have only one executor per file maximum

> On 14 Feb 2017, at 09:36, Henry Tremblay <paulhtremb...@gmail.com> wrote:
> 
> When I use wholeTextFiles, spark does not run in parallel, and yarn runs out 
> of memory. 
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I 
> create an rdd by:
> 
> 
> sc.wholeTextFiles("/mnt/temp")
> 
> 
> Then I process the files line by line using a simple function. When I look at 
> my nodes, I see only one executor is running. (I assume the other is the name 
> node?) I then get an error message that yarn has run out of memory.
> 
> 
> Steps below:
> 
> 
> 
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r--   3 hadoop hadoop3684566 2017-02-14 07:58 
> /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop3486510 2017-02-14 08:01 
> /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop3498649 2017-02-14 08:05 
> /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop4007644 2017-02-14 08:06 
> /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop3990553 2017-02-14 08:07 
> /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop3689213 2017-02-14 07:54 
> /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
> 
> 
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
> In [7]: rdd1.count()
> Out[7]: 6
> 
> def process_file(s):
> text = s[1]
> d = {}
> l =  text.split("\n")
> final = []
> the_id = "init"
> for line in l:
> if line[0:15] == 'WARC-Record-ID:':
> the_id = line[15:]
> d[the_id] = line
> final.append(Row(**d))
> return final
> 
> 
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
> 
> 
> 
> 
> 
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on 
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for 
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, 
> ip-172-31-35-32.us-west-2.compute.internal, executor 2): ExecutorLostFailure 
> (executor 2 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on 
> ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for 
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4, 
> ip-172-31-45-106.us-west-2.compute.internal, executor 3): ExecutorLostFailure 
> (executor 3 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on 
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for 
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5, 
> ip-172-31-35-32.us-west-2.compute.internal, executor 4): ExecutorLostFailure 
> (executor 4 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
> killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 
> -- 
> Henry Tremblay
> Robert Half Technology


Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Henry Tremblay

51,000 files at about 1/2 MB per file. I am wondering if I need this

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

Although if I am understanding you correctly, even if I copy the S3 
files to HDFS on EMR, and use wholeTextFiles, I am still only going to 
be able to use a single executor?


Henry

On 02/11/2017 01:03 PM, Jörn Franke wrote:
Can you post more information about the number of files, their size 
and the executor logs.


A gzipped file is not splittable i.e. Only one executor can gunzip it 
(the unzipped data can then be processed in parallel).
Wholetextfile was designed to be executed only on one executor (e.g. 
For processing xmls which are difficult to process in parallel).


Then, if you have small files (< HDFS blocksize) they are also only 
processed on one executor by default.


You may repartition though for parallel processing in even those cases.

On 11 Feb 2017, at 21:40, Paul Tremblay <paulhtremb...@gmail.com 
<mailto:paulhtremb...@gmail.com>> wrote:


I've been working on this problem for several days (I am doing more 
to increase my knowledge of Spark). The code you linked to hangs 
because after reading in the file, I have to gunzip it.


Another way that seems to be working is reading each file in using 
sc.textFile, and then writing it the HDFS, and then using 
wholeTextFiles for the HDFS result.


But the bigger issue is that both methods are not executed in 
parallel. When I open my yarn manager, it shows that only one node is 
being used.



Henry


On 02/06/2017 03:39 PM, Jon Gregg wrote:
Strange that it's working for some directories but not others.  
Looks like wholeTextFiles maybe doesn't work with S3? 
https://issues.apache.org/jira/browse/SPARK-4414 .


If it's possible to load the data into EMR and run Spark from there 
that may be a workaround.  This blogspot shows a python workaround 
that might work as well: 
http://michaelryanbell.com/processing-whole-files-spark-s3.html


Jon


On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay 
<paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote:


I've actually been able to trace the problem to the files being
read in. If I change to a different directory, then I don't get
the error. Is one of the executors running out of memory?





On 02/06/2017 02:35 PM, Paul Tremblay wrote:

When I try to create an rdd using wholeTextFiles, I get an
incomprehensible error. But when I use the same path with
sc.textFile, I get no error.

I am using pyspark with spark 2.1.

in_path =

's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/

rdd = sc.wholeTextFiles(in_path)

rdd.take(1)


/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
   1341
   1342 p = range(partsScanned, min(partsScanned
+ numPartsToTry, totalParts))
-> 1343 res = self.context.runJob(self,
takeUpToNumLeft, p)
   1344
   1345 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self,
rdd, partitionFunc, partitions, allowLocal)
963 # SparkContext#runJob.
964 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 965 port =
self._jvm.PythonRDD.runJob(self._jsc.sc <http://jsc.sc>(),
mappedRDD._jrdd, partitions)
966 return list(_load_from_socket(port,
mappedRDD._jrdd_deserializer))
967

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
in __call__(self, *args)
   1131 answer =
self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client,
self.target_id, self.name <http://self.name>)
   1134
   1135 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling
{0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage
fail

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Jörn Franke
Can you post more information about the number of files, their size and the 
executor logs.

A gzipped file is not splittable i.e. Only one executor can gunzip it (the 
unzipped data can then be processed in parallel). 
Wholetextfile was designed to be executed only on one executor (e.g. For 
processing xmls which are difficult to process in parallel).

Then, if you have small files (< HDFS blocksize) they are also only processed 
on one executor by default.

You may repartition though for parallel processing in even those cases.

> On 11 Feb 2017, at 21:40, Paul Tremblay <paulhtremb...@gmail.com> wrote:
> 
> I've been working on this problem for several days (I am doing more to 
> increase my knowledge of Spark). The code you linked to hangs because after 
> reading in the file, I have to gunzip it. 
> Another way that seems to be working is reading each file in using 
> sc.textFile, and then writing it the HDFS, and then using wholeTextFiles for 
> the HDFS result. 
> But the bigger issue is that both methods are not executed in parallel. When 
> I open my yarn manager, it shows that only one node is being used. 
> 
> Henry
> 
>> On 02/06/2017 03:39 PM, Jon Gregg wrote:
>> Strange that it's working for some directories but not others.  Looks like 
>> wholeTextFiles maybe doesn't work with S3?  
>> https://issues.apache.org/jira/browse/SPARK-4414 .  
>> 
>> If it's possible to load the data into EMR and run Spark from there that may 
>> be a workaround.  This blogspot shows a python workaround that might work as 
>> well: http://michaelryanbell.com/processing-whole-files-spark-s3.html
>> 
>> Jon
>> 
>> 
>> On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com> 
>> wrote:
>>> I've actually been able to trace the problem to the files being read in. If 
>>> I change to a different directory, then I don't get the error. Is one of 
>>> the executors running out of memory?
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 02/06/2017 02:35 PM, Paul Tremblay wrote:
>>>> When I try to create an rdd using wholeTextFiles, I get an 
>>>> incomprehensible error. But when I use the same path with sc.textFile, I 
>>>> get no error.
>>>> 
>>>> I am using pyspark with spark 2.1.
>>>> 
>>>> in_path = 
>>>> 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/
>>>> 
>>>> rdd = sc.wholeTextFiles(in_path)
>>>> 
>>>> rdd.take(1)
>>>> 
>>>> 
>>>> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
>>>>1341
>>>>1342 p = range(partsScanned, min(partsScanned + 
>>>> numPartsToTry, totalParts))
>>>> -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
>>>>1344
>>>>1345 items += res
>>>> 
>>>> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, 
>>>> partitionFunc, partitions, allowLocal)
>>>> 963 # SparkContext#runJob.
>>>> 964 mappedRDD = rdd.mapPartitions(partitionFunc)
>>>> --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), 
>>>> mappedRDD._jrdd, partitions)
>>>> 966 return list(_load_from_socket(port, 
>>>> mappedRDD._jrdd_deserializer))
>>>> 967
>>>> 
>>>> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in 
>>>> __call__(self, *args)
>>>>1131 answer = self.gateway_client.send_command(command)
>>>>1132 return_value = get_return_value(
>>>> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>>>>1134
>>>>1135 for temp_arg in temp_args:
>>>> 
>>>> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>>>>  61 def deco(*a, **kw):
>>>>  62 try:
>>>> ---> 63 return f(*a, **kw)
>>>>  64 except py4j.protocol.Py4JJavaError as e:
>>>>  65 s = e.java_exception.toString()
>>>> 
>>>> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in 
>>>> get_return_value(answer, gateway_client, target_id, name)
>>>> 317 raise Py4JJavaError(
>>>> 318 "An error occurred while calling {0}{1}{2}.\n".
>>>> --> 319 format

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay
I've been working on this problem for several days (I am doing more to 
increase my knowledge of Spark). The code you linked to hangs because 
after reading in the file, I have to gunzip it.


Another way that seems to be working is reading each file in using 
sc.textFile, and then writing it the HDFS, and then using wholeTextFiles 
for the HDFS result.


But the bigger issue is that both methods are not executed in parallel. 
When I open my yarn manager, it shows that only one node is being used.



Henry


On 02/06/2017 03:39 PM, Jon Gregg wrote:
Strange that it's working for some directories but not others.  Looks 
like wholeTextFiles maybe doesn't work with S3? 
https://issues.apache.org/jira/browse/SPARK-4414 .


If it's possible to load the data into EMR and run Spark from there 
that may be a workaround.  This blogspot shows a python workaround 
that might work as well: 
http://michaelryanbell.com/processing-whole-files-spark-s3.html


Jon


On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com 
<mailto:paulhtremb...@gmail.com>> wrote:


I've actually been able to trace the problem to the files being
read in. If I change to a different directory, then I don't get
the error. Is one of the executors running out of memory?





On 02/06/2017 02:35 PM, Paul Tremblay wrote:

When I try to create an rdd using wholeTextFiles, I get an
incomprehensible error. But when I use the same path with
sc.textFile, I get no error.

I am using pyspark with spark 2.1.

in_path =

's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/

rdd = sc.wholeTextFiles(in_path)

rdd.take(1)


/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
   1341
   1342 p = range(partsScanned, min(partsScanned +
numPartsToTry, totalParts))
-> 1343 res = self.context.runJob(self,
takeUpToNumLeft, p)
   1344
   1345 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd,
partitionFunc, partitions, allowLocal)
963 # SparkContext#runJob.
964 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc
<http://jsc.sc>(), mappedRDD._jrdd, partitions)
966 return list(_load_from_socket(port,
mappedRDD._jrdd_deserializer))
967

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client,
self.target_id, self.name <http://self.name>)
   1134
   1135 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling
{0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 1.0 failed 4 times, most recent
failure: Lost task 0.3 in stage 1.0 (TID 7,
ip-172-31-45-114.us-west-2.com
<http://ip-172-31-45-114.us-west-2.com>pute.internal, executor
8): ExecutorLostFailure (executor 8 exited caused by one of
the running tasks) Reason: Container marked as failed:
container_1486415078210_0005_01_16 on host:
ip-172-31-45-114.us-west-2.com
<http://ip-172-31-45-114.us-west-2.com>pute.internal. Exit
status: 52. Diagnostics: Exception from container-launch.
Container id: container_1486415078210_0005_01_16
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at

org.apache.hadoop.yarn.server.nodemanager.Def

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Jon Gregg
Strange that it's working for some directories but not others.  Looks like
wholeTextFiles maybe doesn't work with S3?
https://issues.apache.org/jira/browse/SPARK-4414 .

If it's possible to load the data into EMR and run Spark from there that
may be a workaround.  This blogspot shows a python workaround that might
work as well:
http://michaelryanbell.com/processing-whole-files-spark-s3.html

Jon


On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com>
wrote:

> I've actually been able to trace the problem to the files being read in.
> If I change to a different directory, then I don't get the error. Is one of
> the executors running out of memory?
>
>
>
>
>
> On 02/06/2017 02:35 PM, Paul Tremblay wrote:
>
>> When I try to create an rdd using wholeTextFiles, I get an
>> incomprehensible error. But when I use the same path with sc.textFile, I
>> get no error.
>>
>> I am using pyspark with spark 2.1.
>>
>> in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/148069
>> 8542939.6/warc/
>>
>> rdd = sc.wholeTextFiles(in_path)
>>
>> rdd.take(1)
>>
>>
>> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
>>1341
>>1342 p = range(partsScanned, min(partsScanned +
>> numPartsToTry, totalParts))
>> -> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
>>1344
>>1345 items += res
>>
>> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd,
>> partitionFunc, partitions, allowLocal)
>> 963 # SparkContext#runJob.
>> 964 mappedRDD = rdd.mapPartitions(partitionFunc)
>> --> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(),
>> mappedRDD._jrdd, partitions)
>> 966 return list(_load_from_socket(port,
>> mappedRDD._jrdd_deserializer))
>> 967
>>
>> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in
>> __call__(self, *args)
>>1131 answer = self.gateway_client.send_command(command)
>>1132 return_value = get_return_value(
>> -> 1133 answer, self.gateway_client, self.target_id,
>> self.name)
>>1134
>>1135 for temp_arg in temp_args:
>>
>> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>>  61 def deco(*a, **kw):
>>  62 try:
>> ---> 63 return f(*a, **kw)
>>  64 except py4j.protocol.Py4JJavaError as e:
>>  65 s = e.java_exception.toString()
>>
>> /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in
>> get_return_value(answer, gateway_client, target_id, name)
>> 317 raise Py4JJavaError(
>> 318 "An error occurred while calling
>> {0}{1}{2}.\n".
>> --> 319 format(target_id, ".", name), value)
>> 320 else:
>> 321 raise Py4JError(
>>
>> Py4JJavaError: An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.runJob.
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in
>> stage 1.0 (TID 7, ip-172-31-45-114.us-west-2.compute.internal, executor
>> 8): ExecutorLostFailure (executor 8 exited caused by one of the running
>> tasks) Reason: Container marked as failed: 
>> container_1486415078210_0005_01_16
>> on host: ip-172-31-45-114.us-west-2.compute.internal. Exit status: 52.
>> Diagnostics: Exception from container-launch.
>> Container id: container_1486415078210_0005_01_16
>> Exit code: 52
>> Stack trace: ExitCodeException exitCode=52:
>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
>> at org.apache.hadoop.util.Shell.run(Shell.java:479)
>> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Sh
>> ell.java:773)
>> at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerEx
>> ecutor.launchContainer(DefaultContainerExecutor.java:212)
>> at org.apache.hadoop.yarn.server.nodemanager.containermanager.l
>> auncher.ContainerLaunch.call(ContainerLaunch.java:302)
>> at org.apache.hadoop.yarn.server.nodemanager.containermanager.l
>> auncher.ContainerLaunch.call(ContainerLaunch.java:82)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> rdd = sc.textFile(in_path)
>>
>> In [8]: rdd.take(1)
>> Out[8]: [u'WARC/1.0']
>>
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
I've actually been able to trace the problem to the files being read in. 
If I change to a different directory, then I don't get the error. Is one 
of the executors running out of memory?





On 02/06/2017 02:35 PM, Paul Tremblay wrote:
When I try to create an rdd using wholeTextFiles, I get an 
incomprehensible error. But when I use the same path with sc.textFile, 
I get no error.


I am using pyspark with spark 2.1.

in_path = 
's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/


rdd = sc.wholeTextFiles(in_path)

rdd.take(1)


/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
   1341
   1342 p = range(partsScanned, min(partsScanned + 
numPartsToTry, totalParts))

-> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
   1344
   1345 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, 
partitionFunc, partitions, allowLocal)

963 # SparkContext#runJob.
964 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), 
mappedRDD._jrdd, partitions)
966 return list(_load_from_socket(port, 
mappedRDD._jrdd_deserializer))

967

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in 
__call__(self, *args)

   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, 
self.name)

   1134
   1135 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)

317 raise Py4JJavaError(
318 "An error occurred while calling 
{0}{1}{2}.\n".

--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 1.0 (TID 7, ip-172-31-45-114.us-west-2.compute.internal, 
executor 8): ExecutorLostFailure (executor 8 exited caused by one of 
the running tasks) Reason: Container marked as failed: 
container_1486415078210_0005_01_16 on host: 
ip-172-31-45-114.us-west-2.compute.internal. Exit status: 52. 
Diagnostics: Exception from container-launch.

Container id: container_1486415078210_0005_01_16
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

rdd = sc.textFile(in_path)

In [8]: rdd.take(1)
Out[8]: [u'WARC/1.0']




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
When I try to create an rdd using wholeTextFiles, I get an 
incomprehensible error. But when I use the same path with sc.textFile, I 
get no error.


I am using pyspark with spark 2.1.

in_path = 
's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/


rdd = sc.wholeTextFiles(in_path)

rdd.take(1)


/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
   1341
   1342 p = range(partsScanned, min(partsScanned + 
numPartsToTry, totalParts))

-> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
   1344
   1345 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, 
partitionFunc, partitions, allowLocal)

963 # SparkContext#runJob.
964 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 965 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), 
mappedRDD._jrdd, partitions)
966 return list(_load_from_socket(port, 
mappedRDD._jrdd_deserializer))

967

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in 
__call__(self, *args)

   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)

317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 1.0 (TID 7, ip-172-31-45-114.us-west-2.compute.internal, 
executor 8): ExecutorLostFailure (executor 8 exited caused by one of the 
running tasks) Reason: Container marked as failed: 
container_1486415078210_0005_01_16 on host: 
ip-172-31-45-114.us-west-2.compute.internal. Exit status: 52. 
Diagnostics: Exception from container-launch.

Container id: container_1486415078210_0005_01_16
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

rdd = sc.textFile(in_path)

In [8]: rdd.take(1)
Out[8]: [u'WARC/1.0']


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Handling Input Error in wholeTextFiles

2017-01-11 Thread khwunchai jaengsawang
Hi all,

I have a requirement to process multiple splittable gzip files and the results 
need to include each individual file name.
I come across a problem when loading multiple gzip files using wholeTextFiles 
method and some files are corrupted causing ‘unexpected end of input stream' 
error and the whole job is failed. I tried using flatMap with try statement, 
still failed. Is there any way to handle this?


Regards,

Khwunchai Jaengsawang
Email: khwuncha...@ku.th
Mobile: +66 88 228 1715
LinkedIn <https://linkedin.com/in/khwunchai> | Github 
<https://github.com/khwunchai> | Facebook 
<https://www.facebook.com/khwunchai.j> | Google+ 
<http://google.com/+KhwunchaiJaengsawang>


Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Also, in case the issue was not due to the string length (however it
is still valid and may get you later), the issue may be due to some
other indexing issues which are currently being worked on here
https://issues.apache.org/jira/browse/SPARK-6235

On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky  wrote:
> Hi Pradeep,
>
> I'm afraid you're running into a hard Java issue. Strings are indexed
> with signed integers and can therefore not be longer than
> approximately 2 billion characters. Could you use `textFile` as a
> workaround? It will give you an RDD of the files' lines instead.
>
> In general, this guide http://spark.apache.org/contributing.html gives
> information on how to contribute to spark, including instructions on
> how to file bug reports (which does not apply in this case as it isn't
> a bug in Spark).
>
> regards,
> --Jakob
>
> On Mon, Dec 12, 2016 at 7:34 PM, Pradeep  wrote:
>> Hi,
>>
>> Why there is an restriction on max file size that can be read by 
>> wholeTextFile() method.
>>
>> I can read a 1.5 gigs file but get Out of memory for 2 gig file.
>>
>> Also, how can I raise this as an defect in spark jira. Can someone please 
>> guide.
>>
>> Thanks,
>> Pradeep
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Hi Pradeep,

I'm afraid you're running into a hard Java issue. Strings are indexed
with signed integers and can therefore not be longer than
approximately 2 billion characters. Could you use `textFile` as a
workaround? It will give you an RDD of the files' lines instead.

In general, this guide http://spark.apache.org/contributing.html gives
information on how to contribute to spark, including instructions on
how to file bug reports (which does not apply in this case as it isn't
a bug in Spark).

regards,
--Jakob

On Mon, Dec 12, 2016 at 7:34 PM, Pradeep  wrote:
> Hi,
>
> Why there is an restriction on max file size that can be read by 
> wholeTextFile() method.
>
> I can read a 1.5 gigs file but get Out of memory for 2 gig file.
>
> Also, how can I raise this as an defect in spark jira. Can someone please 
> guide.
>
> Thanks,
> Pradeep
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



wholeTextFiles()

2016-12-12 Thread Pradeep
Hi,

Why there is an restriction on max file size that can be read by 
wholeTextFile() method.

I can read a 1.5 gigs file but get Out of memory for 2 gig file. 

Also, how can I raise this as an defect in spark jira. Can someone please guide.

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
Well I have already tried that.
You are talking about a command similar to this right? *yarn logs
-applicationId application_Number *
This gives me the processing logs, that contain information about the
tasks, RDD blocks etc.

What I really need is the output log that gets generated as part of the
Spark job. Which means I generate some output by the Spark job that gets
written to a file mentioned in the job itself. So this file is currently
residing within the appcache, is there a way that I can get this once the
job is over?



On Wed, Sep 21, 2016 at 4:00 PM, ayan guha <guha.a...@gmail.com> wrote:

> On yarn, logs are aggregated from each containers to hdfs. You can use
> yarn CLI or ui to view. For spark, you would have a history server which
> consolidate s the logs
> On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote:
>
>> I looked at the driver logs, that reminded me that I needed to look at
>> the executor logs. There the issue was that the spark executors were not
>> getting a configuration file. I broadcasted the file and now the processing
>> happens. Thanks for the suggestion.
>> Currently my issue is that the log file generated independently by the
>> executors goes to the respective containers' appcache, and then it gets
>> lost. Is there a recommended way to get the output files from the
>> individual executors?
>>
>> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com>
>> wrote:
>>
>>> Are you looking at the worker logs or the driver?
>>>
>>>
>>> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
>>> wrote:
>>>
>>>> I have an RDD created as follows:
>>>>
>>>> *JavaPairRDD<String,String> inputDataFiles =
>>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>>>
>>>> On this RDD I perform a map to process individual files and invoke a
>>>> foreach to trigger the same map.
>>>>
>>>>* JavaRDD<Object[]> output = inputDataFiles.map(new
>>>> Function<Tuple2<String,String>,Object[]>()*
>>>> *{*
>>>>
>>>> *private static final long serialVersionUID = 1L;*
>>>>
>>>> * @Override*
>>>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>>>> *{ *
>>>> *  System.out.println("in map!");*
>>>> *   //do something with v1. *
>>>> *  return Object[]*
>>>> *} *
>>>> *});*
>>>>
>>>> *output.foreach(new VoidFunction<Object[]>() {*
>>>>
>>>> * private static final long serialVersionUID = 1L;*
>>>>
>>>> * @Override*
>>>> * public void call(Object[] t) throws Exception {*
>>>> * //do nothing!*
>>>> * System.out.println("in foreach!");*
>>>> * }*
>>>> * }); *
>>>>
>>>> This code works perfectly fine for standalone setup on my local laptop
>>>> while accessing both local files as well as remote HDFS files.
>>>>
>>>> In cluster the same code produces no results. My intuition is that the
>>>> data has not reached the individual executors and hence both the `map` and
>>>> `foreach` does not work. It might be a guess. But I am not able to figure
>>>> out why this would not work in cluster. I dont even see the print
>>>> statements in `map` and `foreach` getting printed in cluster mode of
>>>> execution.
>>>>
>>>> I notice a particular line in standalone output that I do NOT see in
>>>> cluster execution.
>>>>
>>>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>>>
>>>> I had a similar code with textFile() that worked earlier for individual
>>>> files on cluster. The issue is with wholeTextFiles() only.
>>>>
>>>> Please advise what is the best way to get this working or other
>>>> alternate ways.
>>>>
>>>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>>>> master as `yarn-client`.
>>>>
>>>> The action can be anything. Its just a dummy step to invoke the map. I
>>>> also tried *System.out.println("Count is:"+output.count());*, for
>>>> which I got the correct answer of `10`, since there were 10 files in the
>>>> folder, but still the map refuses to work.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>> --
>>> Thanks,
>>> Sonal
>>> Nube Technologies <http://www.nubetech.co>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nisha Menon
>> BTech (CS) Sahrdaya CET,
>> MTech (CS) IIIT Banglore.
>>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.


Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread ayan guha
On yarn, logs are aggregated from each containers to hdfs. You can use yarn
CLI or ui to view. For spark, you would have a history server which
consolidate s the logs
On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote:

> I looked at the driver logs, that reminded me that I needed to look at the
> executor logs. There the issue was that the spark executors were not
> getting a configuration file. I broadcasted the file and now the processing
> happens. Thanks for the suggestion.
> Currently my issue is that the log file generated independently by the
> executors goes to the respective containers' appcache, and then it gets
> lost. Is there a recommended way to get the output files from the
> individual executors?
>
> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com>
> wrote:
>
>> Are you looking at the worker logs or the driver?
>>
>>
>> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
>> wrote:
>>
>>> I have an RDD created as follows:
>>>
>>> *JavaPairRDD<String,String> inputDataFiles =
>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>>
>>> On this RDD I perform a map to process individual files and invoke a
>>> foreach to trigger the same map.
>>>
>>>* JavaRDD<Object[]> output = inputDataFiles.map(new
>>> Function<Tuple2<String,String>,Object[]>()*
>>> *{*
>>>
>>> *private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>>> *{ *
>>> *  System.out.println("in map!");*
>>> *   //do something with v1. *
>>> *  return Object[]*
>>> *} *
>>> *});*
>>>
>>> *output.foreach(new VoidFunction<Object[]>() {*
>>>
>>> * private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public void call(Object[] t) throws Exception {*
>>> * //do nothing!*
>>> * System.out.println("in foreach!");*
>>> * }*
>>> * }); *
>>>
>>> This code works perfectly fine for standalone setup on my local laptop
>>> while accessing both local files as well as remote HDFS files.
>>>
>>> In cluster the same code produces no results. My intuition is that the
>>> data has not reached the individual executors and hence both the `map` and
>>> `foreach` does not work. It might be a guess. But I am not able to figure
>>> out why this would not work in cluster. I dont even see the print
>>> statements in `map` and `foreach` getting printed in cluster mode of
>>> execution.
>>>
>>> I notice a particular line in standalone output that I do NOT see in
>>> cluster execution.
>>>
>>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>>
>>> I had a similar code with textFile() that worked earlier for individual
>>> files on cluster. The issue is with wholeTextFiles() only.
>>>
>>> Please advise what is the best way to get this working or other
>>> alternate ways.
>>>
>>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>>> master as `yarn-client`.
>>>
>>> The action can be anything. Its just a dummy step to invoke the map. I
>>> also tried *System.out.println("Count is:"+output.count());*, for which
>>> I got the correct answer of `10`, since there were 10 files in the folder,
>>> but still the map refuses to work.
>>>
>>> Thanks.
>>>
>>>
>>
>> --
>> Thanks,
>> Sonal
>> Nube Technologies <http://www.nubetech.co>
>>
>> <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>
>
> --
> Nisha Menon
> BTech (CS) Sahrdaya CET,
> MTech (CS) IIIT Banglore.
>


How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
I looked at the driver logs, that reminded me that I needed to look at the
executor logs. There the issue was that the spark executors were not
getting a configuration file. I broadcasted the file and now the processing
happens. Thanks for the suggestion.
Currently my issue is that the log file generated independently by the
executors goes to the respective containers' appcache, and then it gets
lost. Is there a recommended way to get the output files from the
individual executors?

On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> Are you looking at the worker logs or the driver?
>
>
> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
> wrote:
>
>> I have an RDD created as follows:
>>
>> *JavaPairRDD<String,String> inputDataFiles =
>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>
>> On this RDD I perform a map to process individual files and invoke a
>> foreach to trigger the same map.
>>
>>* JavaRDD<Object[]> output = inputDataFiles.map(new
>> Function<Tuple2<String,String>,Object[]>()*
>> *{*
>>
>> *private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>> *{ *
>> *  System.out.println("in map!");*
>> *   //do something with v1. *
>> *  return Object[]*
>> *} *
>> *});*
>>
>> *output.foreach(new VoidFunction<Object[]>() {*
>>
>> * private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public void call(Object[] t) throws Exception {*
>> * //do nothing!*
>> * System.out.println("in foreach!");*
>> * }*
>> * }); *
>>
>> This code works perfectly fine for standalone setup on my local laptop
>> while accessing both local files as well as remote HDFS files.
>>
>> In cluster the same code produces no results. My intuition is that the
>> data has not reached the individual executors and hence both the `map` and
>> `foreach` does not work. It might be a guess. But I am not able to figure
>> out why this would not work in cluster. I dont even see the print
>> statements in `map` and `foreach` getting printed in cluster mode of
>> execution.
>>
>> I notice a particular line in standalone output that I do NOT see in
>> cluster execution.
>>
>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>
>> I had a similar code with textFile() that worked earlier for individual
>> files on cluster. The issue is with wholeTextFiles() only.
>>
>> Please advise what is the best way to get this working or other alternate
>> ways.
>>
>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>> master as `yarn-client`.
>>
>> The action can be anything. Its just a dummy step to invoke the map. I
>> also tried *System.out.println("Count is:"+output.count());*, for which
>> I got the correct answer of `10`, since there were 10 files in the folder,
>> but still the map refuses to work.
>>
>> Thanks.
>>
>>
>
> --
> Thanks,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.


Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Sonal Goyal
Are you looking at the worker logs or the driver?

On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> wrote:

> I have an RDD created as follows:
>
> *JavaPairRDD<String,String> inputDataFiles =
> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>
> On this RDD I perform a map to process individual files and invoke a
> foreach to trigger the same map.
>
>* JavaRDD<Object[]> output = inputDataFiles.map(new
> Function<Tuple2<String,String>,Object[]>()*
> *{*
>
> *private static final long serialVersionUID = 1L;*
>
> * @Override*
> * public Object[] call(Tuple2<String,String> v1) throws Exception *
> *{ *
> *  System.out.println("in map!");*
> *   //do something with v1. *
> *  return Object[]*
> *} *
> *});*
>
> *output.foreach(new VoidFunction<Object[]>() {*
>
> * private static final long serialVersionUID = 1L;*
>
> * @Override*
> * public void call(Object[] t) throws Exception {*
> * //do nothing!*
> * System.out.println("in foreach!");*
> * }*
> * }); *
>
> This code works perfectly fine for standalone setup on my local laptop
> while accessing both local files as well as remote HDFS files.
>
> In cluster the same code produces no results. My intuition is that the
> data has not reached the individual executors and hence both the `map` and
> `foreach` does not work. It might be a guess. But I am not able to figure
> out why this would not work in cluster. I dont even see the print
> statements in `map` and `foreach` getting printed in cluster mode of
> execution.
>
> I notice a particular line in standalone output that I do NOT see in
> cluster execution.
>
> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>
> I had a similar code with textFile() that worked earlier for individual
> files on cluster. The issue is with wholeTextFiles() only.
>
> Please advise what is the best way to get this working or other alternate
> ways.
>
> My setup is cloudera 5.7 distribution with Spark Service. I used the
> master as `yarn-client`.
>
> The action can be anything. Its just a dummy step to invoke the map. I
> also tried *System.out.println("Count is:"+output.count());*, for which I
> got the correct answer of `10`, since there were 10 files in the folder,
> but still the map refuses to work.
>
> Thanks.
>
>

-- 
Thanks,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>


How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Nisha Menon
I have an RDD created as follows:

*JavaPairRDD<String,String> inputDataFiles =
sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*

On this RDD I perform a map to process individual files and invoke a
foreach to trigger the same map.

   * JavaRDD<Object[]> output = inputDataFiles.map(new
Function<Tuple2<String,String>,Object[]>()*
*{*

*private static final long serialVersionUID = 1L;*

* @Override*
* public Object[] call(Tuple2<String,String> v1) throws Exception *
*{ *
*  System.out.println("in map!");*
*   //do something with v1. *
*  return Object[]*
*} *
*});*

*output.foreach(new VoidFunction<Object[]>() {*

* private static final long serialVersionUID = 1L;*

* @Override*
* public void call(Object[] t) throws Exception {*
* //do nothing!*
* System.out.println("in foreach!");*
* }*
* }); *

This code works perfectly fine for standalone setup on my local laptop
while accessing both local files as well as remote HDFS files.

In cluster the same code produces no results. My intuition is that the data
has not reached the individual executors and hence both the `map` and
`foreach` does not work. It might be a guess. But I am not able to figure
out why this would not work in cluster. I dont even see the print
statements in `map` and `foreach` getting printed in cluster mode of
execution.

I notice a particular line in standalone output that I do NOT see in
cluster execution.

*16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*

I had a similar code with textFile() that worked earlier for individual
files on cluster. The issue is with wholeTextFiles() only.

Please advise what is the best way to get this working or other alternate
ways.

My setup is cloudera 5.7 distribution with Spark Service. I used the master
as `yarn-client`.

The action can be anything. Its just a dummy step to invoke the map. I also
tried *System.out.println("Count is:"+output.count());*, for which I got
the correct answer of `10`, since there were 10 files in the folder, but
still the map refuses to work.

Thanks.


Re: Issue with wholeTextFiles

2016-03-22 Thread Akhil Das
Can you paste the exception stack here?

Thanks
Best Regards

On Mon, Mar 21, 2016 at 1:42 PM, Sarath Chandra <
sarathchandra.jos...@algofusiontech.com> wrote:

> I'm using Hadoop 1.0.4 and Spark 1.2.0.
>
> I'm facing a strange issue. I have a requirement to read a small file from
> HDFS and all it's content has to be read at one shot. So I'm using spark
> context's wholeTextFiles API passing the HDFS URL for the file.
>
> When I try this from a spark shell it's works as mentioned in the
> documentation, but when I try the same through program (by submitting job
> to cluster) I get FileNotFoundException. I have all compatible JARs in
> place.
>
> Please help.
>
>
>


Issue with wholeTextFiles

2016-03-21 Thread Sarath Chandra
I'm using Hadoop 1.0.4 and Spark 1.2.0.

I'm facing a strange issue. I have a requirement to read a small file from
HDFS and all it's content has to be read at one shot. So I'm using spark
context's wholeTextFiles API passing the HDFS URL for the file.

When I try this from a spark shell it's works as mentioned in the
documentation, but when I try the same through program (by submitting job
to cluster) I get FileNotFoundException. I have all compatible JARs in
place.

Please help.


wholeTextFiles(/x/*/*.txt) runs single threaded

2015-07-02 Thread Kostas Kougios
Hi, I got a cluster of 4 machines and I

sc.wholeTextFiles(/x/*/*.txt)

folder x contains subfolders and each subfolder contains thousand of files
with a total of ~1million matching the path expression.

My spark task starts processing the files but single threaded. I can see
that in the sparkUI, only 1 executor is used out of 4. And only 1 thread out
of configured 24:

spark-submit --class com.stratified.articleids.NxmlExtractorJob \
--driver-memory 8g \
--executor-memory 8g \
--num-executors 4 \
--executor-cores 16 \
--master yarn-cluster \
--conf spark.akka.frameSize=128 \
$JAR


My actual code is :

  val rdd=extractIds(sc.wholeTextFiles(xmlDir))
  rdd.saveAsObjectFile(serDir)

Is the saveAsObjectFile causing this and any workarounds?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: wholeTextFiles(/x/*/*.txt) runs single threaded

2015-07-02 Thread Kostas Kougios
In SparkUI I can see it creating 2 stages. I tried
wholeTextFiles().repartition(32) but same threading results.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



wholeTextFiles on 20 nodes

2014-11-23 Thread Simon Hafner
I have 20 nodes via EC2 and an application that reads the data via
wholeTextFiles. I've tried to copy the data into hadoop via
copyFromLocal, and I get

14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in
createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad
connect ack with firstBadLink as X:50010
14/11/24 02:00:07 INFO hdfs.DFSClient: Abandoning block
blk_-8725559184260876712_2627
14/11/24 02:00:07 INFO hdfs.DFSClient: Excluding datanode X:50010

a lot. Then I went the file system route via copy-dir, which worked
well. Now everything is under /root/txt on all nodes. I submitted the
job with the file:///root/txt/ directory for wholeTextFiles() and I
get

Exception in thread main java.io.FileNotFoundException: File does
not exist: /root/txt/3521.txt

The file exists on the root note and should be everywhere according to
copy-dir. The hadoop variant worked fine with 3 nodes, but it starts
bugging with 20. I added

  property
namedfs.datanode.max.transfer.threads/name
value4096/value
  /property

to hdfs-site.xml and core-site.xml, didn't help.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue
seems related to Hadoop1.  When switching to using
spark-1.0.2-bin-hadoop*2*, the issue disappears.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
I have the same issue

val a = sc.textFile(s3n://MyBucket/MyFolder/*.tif)
a.first

works perfectly fine, but 

val d = sc.wholeTextFiles(s3n://MyBucket/MyFolder/*.tif)  does not
work
d.first

Gives the following error message

java.io.FileNotFoundException: File /MyBucket/MyFolder.tif does not
exist.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
That worked for me as well, I was using spark 1.0 compiled against Hadoop
1.0, switching to 1.0.1 compiled against hadoop 2



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles like for binary files ?

2014-06-26 Thread Akhil Das
You cannot read image files with wholeTextFiles because it uses
CombineFileInputFormat which cannot read gripped files because they are not
splittable http://www.bigdataspeak.com/2013_01_01_archive.html (source
proving it):

  override def createRecordReader(
  split: InputSplit,
  context: TaskAttemptContext): RecordReader[String, String] = {

new CombineFileRecordReader[String, String](
  split.asInstanceOf[CombineFileSplit],
  context,
  classOf[WholeTextFileRecordReader])
  }

You may be able to use newAPIHadoopFile with wholefileinputformat
https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
(not
built into hadoop but all over the internet) to get this to work correctly.
I don't think WholeFileInputFormat will work since it just gets the bytes
of the file, meaning you may have to write your own class possibly
extending WholeFileInputFormat.

Thanks
Best Regards


On Thu, Jun 26, 2014 at 3:31 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Is there an equivalent of wholeTextFiles for binary files for example a
 set of images ?

 Cheers,

 Jaonary



wholeTextFiles and gzip

2014-06-25 Thread Nick Chammas
Interesting question on Stack Overflow:
http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles

Is it possible to read gzipped files using wholeTextFiles()? Alternately,
is it possible to read the source file names using textFile()?
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-and-gzip-tp8283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

wholeTextFiles like for binary files ?

2014-06-25 Thread Jaonary Rabarisoa
Is there an equivalent of wholeTextFiles for binary files for example a set
of images ?

Cheers,

Jaonary


Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I didn't fix the issue so much as work around it. I was running my cluster
locally, so using HDFS was just a preference. The code worked with the local
file system, so that's what I'm using until I can get some help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Xusen Yin
Hi Sguj and littlebird,

I'll try to fix it tomorrow evening and the day after tomorrow, because I
am now busy preparing a talk (slides) tomorrow. Sorry for the inconvenience
to you. Would you mind to write an issue on Spark JIRA?


2014-06-17 20:55 GMT+08:00 Sguj tpcome...@yahoo.com:

 I didn't fix the issue so much as work around it. I was running my cluster
 locally, so using HDFS was just a preference. The code worked with the
 local
 file system, so that's what I'm using until I can get some help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
Best Regards
---
Xusen Yin(尹绪森)
Intel Labs China
Homepage: *http://yinxusen.github.io/ http://yinxusen.github.io/*


Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I can write one if you'll point me to where I need to write it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-16 Thread littlebird
Hi, I have the same exception. Can you tell me how did you fix it? Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-13 Thread visenger
Hi guys,

I ran into the same exception (while trying the same example), and after
overriding hadoop-client artifact in my pom.xml, I got another error
(below). 

System config:
ubuntu 12.04
intellijj 13. 
scala 2.10.3

maven:
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.0.0/version
/dependency
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.4.0/version
/dependency

Any idea why spark 1.0 is incompatible with Hadoop 2?

Thanks for your support in advance!

Exception in thread main java.lang.IncompatibleClassChangeError:
Implementing class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:186)
at
org.apache.hadoop.mapreduce.SparkHadoopMapReduceUtil$class.firstAvailableClass(SparkHadoopMapReduceUtil.scala:73)
at
org.apache.hadoop.mapreduce.SparkHadoopMapReduceUtil$class.newJobContext(SparkHadoopMapReduceUtil.scala:27)
at 
org.apache.spark.rdd.NewHadoopRDD.newJobContext(NewHadoopRDD.scala:61)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:171)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)
at org.apache.spark.rdd.RDD.collect(RDD.scala:717)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p7563.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-13 Thread Sguj
My exception stack looks about the same.

java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)
at org.apache.spark.rdd.RDD.collect(RDD.scala:717)

I'm using Hadoop 1.2.1, and everything else I've tried in Spark with that
version has worked, so I doubt it's a version error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-12 Thread yinxusen
Hi Sguj,

Could you give me the exception stack?

I test it on my laptop and find that it gets the wrong FileSystem. It should
be DistributedFileSystem, but it finds the RawLocalFileSystem.

If we get the same exception stack, I'll try to fix it.

Here is my exception stack:

java.io.FileNotFoundException: File /sen/reuters-out/reut2-000.sgm-0.txt
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097)
at org.apache.spark.rdd.RDD.collect(RDD.scala:728)

Besides, what's your hadoop version?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Sean Owen
Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but
interface was expected is the classic error meaning you compiled
against Hadoop 1, but are running against Hadoop 2

I think you need to override the hadoop-client artifact that Spark
depends on to be a Hadoop 2.x version.

On Tue, Jun 3, 2014 at 5:23 PM, toivoa toivo@gmail.com wrote:
 Hi

 Set up project under Eclipse using Maven:

 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-core_2.10/artifactId
 version1.0.0/version
 /dependency

 Simple example fails:

   def main(args: Array[String]): Unit = {

 val conf = new SparkConf()
  .setMaster(local)
  .setAppName(CountingSheep)
  .set(spark.executor.memory, 1g)

 val sc = new SparkContext(conf)

 val indir = src/main/resources/testdata
 val files = sc.wholeTextFiles(indir, 10)
 for( pair - files)
   println(pair._1 +  =  + pair._2)


 14/06/03 19:20:34 ERROR executor.Executor: Exception in task ID 0
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
 at
 org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:164)
 at
 org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.init(CombineFileRecordReader.java:126)
 at
 org.apache.spark.input.WholeTextFileInputFormat.createRecordReader(WholeTextFileInputFormat.scala:44)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:155)
 ... 13 more
 Caused by: java.lang.IncompatibleClassChangeError: Found class
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
 at
 org.apache.spark.input.WholeTextFileRecordReader.init(WholeTextFileRecordReader.scala:40)
 ... 18 more

 Any idea?

 thanks
 toivo




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread toivoa
Hi

Set up project under Eclipse using Maven:

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.0.0/version
/dependency

Simple example fails:

  def main(args: Array[String]): Unit = {

val conf = new SparkConf()
 .setMaster(local)
 .setAppName(CountingSheep)
 .set(spark.executor.memory, 1g)
 
val sc = new SparkContext(conf)

val indir = src/main/resources/testdata
val files = sc.wholeTextFiles(indir, 10)
for( pair - files)
  println(pair._1 +  =  + pair._2)
  

14/06/03 19:20:34 ERROR executor.Executor: Exception in task ID 0
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:164)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.init(CombineFileRecordReader.java:126)
at
org.apache.spark.input.WholeTextFileInputFormat.createRecordReader(WholeTextFileInputFormat.scala:44)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:155)
... 13 more
Caused by: java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at
org.apache.spark.input.WholeTextFileRecordReader.init(WholeTextFileRecordReader.scala:40)
... 18 more

Any idea?

thanks
toivo




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread toivoa
Wow! What a quick reply!

adding 

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.4.0/version
/dependency
  
solved the problem.

But now I get 

14/06/03 19:52:50 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.init(Groups.java:77)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at 
org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at 
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)

thanks
toivo



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Sean Owen
I'd try the internet / SO first -- these are actually generic
Hadoop-related issues. Here I think you don't have HADOOP_HOME or
similar set.

http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path

On Tue, Jun 3, 2014 at 5:54 PM, toivoa toivo@gmail.com wrote:
 Wow! What a quick reply!

 adding

 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version2.4.0/version
 /dependency

 solved the problem.

 But now I get

 14/06/03 19:52:50 ERROR Shell: Failed to locate the winutils binary in the
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binaries.
 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
 at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
 at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
 at 
 org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
 at org.apache.hadoop.security.Groups.init(Groups.java:77)
 at
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
 at
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
 at
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
 at
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)

 thanks
 toivo



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Matei Zaharia
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs 
just fine without them.

Matei

On Jun 3, 2014, at 10:33 AM, Sean Owen so...@cloudera.com wrote:

 I'd try the internet / SO first -- these are actually generic
 Hadoop-related issues. Here I think you don't have HADOOP_HOME or
 similar set.
 
 http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path
 
 On Tue, Jun 3, 2014 at 5:54 PM, toivoa toivo@gmail.com wrote:
 Wow! What a quick reply!
 
 adding
 
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.4.0/version
/dependency
 
 solved the problem.
 
 But now I get
 
 14/06/03 19:52:50 ERROR Shell: Failed to locate the winutils binary in the
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at 
 org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.init(Groups.java:77)
at
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
 
 thanks
 toivo
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.