Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz
i.e. you will have only one executor per file maximum
> On 14 Feb 2017, at 09:36, Henry Tremblay wrote:
>
> When I use wholeTextFiles, spark does not run in parallel, and yarn runs out
> of
51,000 files at about 1/2 MB per file. I am wondering if I need this
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Although if I am understanding you correctly, even if I copy the S3
files to HDFS on EMR, and use wholeTextFiles, I am still only going to
be able to
Can you post more information about the number of files, their size and the
executor logs.
A gzipped file is not splittable i.e. Only one executor can gunzip it (the
unzipped data can then be processed in parallel).
Wholetextfile was designed to be executed only on one executor (e.g. For
I've been working on this problem for several days (I am doing more to
increase my knowledge of Spark). The code you linked to hangs because
after reading in the file, I have to gunzip it.
Another way that seems to be working is reading each file in using
sc.textFile, and then writing it the
Strange that it's working for some directories but not others. Looks like
wholeTextFiles maybe doesn't work with S3?
https://issues.apache.org/jira/browse/SPARK-4414 .
If it's possible to load the data into EMR and run Spark from there that
may be a workaround. This blogspot shows a python
I've actually been able to trace the problem to the files being read in.
If I change to a different directory, then I don't get the error. Is one
of the executors running out of memory?
On 02/06/2017 02:35 PM, Paul Tremblay wrote:
When I try to create an rdd using wholeTextFiles, I get an
Also, in case the issue was not due to the string length (however it
is still valid and may get you later), the issue may be due to some
other indexing issues which are currently being worked on here
https://issues.apache.org/jira/browse/SPARK-6235
On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky
Hi Pradeep,
I'm afraid you're running into a hard Java issue. Strings are indexed
with signed integers and can therefore not be longer than
approximately 2 billion characters. Could you use `textFile` as a
workaround? It will give you an RDD of the files' lines instead.
In general, this guide
In SparkUI I can see it creating 2 stages. I tried
wholeTextFiles().repartition(32) but same threading results.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html
Sent from the Apache Spark User List
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue
seems related to Hadoop1. When switching to using
spark-1.0.2-bin-hadoop*2*, the issue disappears.
--
View this message in context:
I have the same issue
val a = sc.textFile(s3n://MyBucket/MyFolder/*.tif)
a.first
works perfectly fine, but
val d = sc.wholeTextFiles(s3n://MyBucket/MyFolder/*.tif) does not
work
d.first
Gives the following error message
java.io.FileNotFoundException:
That worked for me as well, I was using spark 1.0 compiled against Hadoop
1.0, switching to 1.0.1 compiled against hadoop 2
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html
Sent from the Apache Spark
You cannot read image files with wholeTextFiles because it uses
CombineFileInputFormat which cannot read gripped files because they are not
splittable http://www.bigdataspeak.com/2013_01_01_archive.html (source
proving it):
override def createRecordReader(
split: InputSplit,
I didn't fix the issue so much as work around it. I was running my cluster
locally, so using HDFS was just a preference. The code worked with the local
file system, so that's what I'm using until I can get some help.
--
View this message in context:
Hi Sguj and littlebird,
I'll try to fix it tomorrow evening and the day after tomorrow, because I
am now busy preparing a talk (slides) tomorrow. Sorry for the inconvenience
to you. Would you mind to write an issue on Spark JIRA?
2014-06-17 20:55 GMT+08:00 Sguj tpcome...@yahoo.com:
I didn't
I can write one if you'll point me to where I need to write it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, I have the same exception. Can you tell me how did you fix it? Thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi guys,
I ran into the same exception (while trying the same example), and after
overriding hadoop-client artifact in my pom.xml, I got another error
(below).
System config:
ubuntu 12.04
intellijj 13.
scala 2.10.3
maven:
dependency
groupIdorg.apache.spark/groupId
My exception stack looks about the same.
java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
Hi Sguj,
Could you give me the exception stack?
I test it on my laptop and find that it gets the wrong FileSystem. It should
be DistributedFileSystem, but it finds the RawLocalFileSystem.
If we get the same exception stack, I'll try to fix it.
Here is my exception stack:
Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but
interface was expected is the classic error meaning you compiled
against Hadoop 1, but are running against Hadoop 2
I think you need to override the hadoop-client artifact that Spark
depends on to be a Hadoop 2.x version.
On Tue,
Wow! What a quick reply!
adding
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.4.0/version
/dependency
solved the problem.
But now I get
14/06/03 19:52:50 ERROR Shell: Failed to locate
I'd try the internet / SO first -- these are actually generic
Hadoop-related issues. Here I think you don't have HADOOP_HOME or
similar set.
http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path
On Tue, Jun 3, 2014 at 5:54 PM, toivoa
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs
just fine without them.
Matei
On Jun 3, 2014, at 10:33 AM, Sean Owen so...@cloudera.com wrote:
I'd try the internet / SO first -- these are actually generic
Hadoop-related issues. Here I think you don't have
24 matches
Mail list logo