Hi,
I can understand facing closure issues while executing this code:
package spark
//this package is about understanding closures as mentioned in:
As part of my processing, I have the following code:
rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10)
rdd.count()
The s3 directory has about 8GB of data and 61,878 files. I am using Spark
2.1, and running it with 15 modes of m3.xlarge nodes on EMR.
The job fails with this error:
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz
i.e. you will have only one executor per file maximum
> On 14 Feb 2017, at 09:36, Henry Tremblay <paulhtremb...@gmail.com> wrote:
>
> When I use wholeTextFiles, spark does not run in parallel, an
51,000 files at about 1/2 MB per file. I am wondering if I need this
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Although if I am understanding you correctly, even if I copy the S3
files to HDFS on EMR, and use wholeTextFiles, I am still only going to
be able
ile in using
> sc.textFile, and then writing it the HDFS, and then using wholeTextFiles for
> the HDFS result.
> But the bigger issue is that both methods are not executed in parallel. When
> I open my yarn manager, it shows that only one node is being used.
>
> Henry
it the HDFS, and then using wholeTextFiles
for the HDFS result.
But the bigger issue is that both methods are not executed in parallel.
When I open my yarn manager, it shows that only one node is being used.
Henry
On 02/06/2017 03:39 PM, Jon Gregg wrote:
Strange that it's working for some
Strange that it's working for some directories but not others. Looks like
wholeTextFiles maybe doesn't work with S3?
https://issues.apache.org/jira/browse/SPARK-4414 .
If it's possible to load the data into EMR and run Spark from there that
may be a workaround. This blogspot shows a python
I've actually been able to trace the problem to the files being read in.
If I change to a different directory, then I don't get the error. Is one
of the executors running out of memory?
On 02/06/2017 02:35 PM, Paul Tremblay wrote:
When I try to create an rdd using wholeTextFiles, I get
When I try to create an rdd using wholeTextFiles, I get an
incomprehensible error. But when I use the same path with sc.textFile, I
get no error.
I am using pyspark with spark 2.1.
in_path =
's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/
rdd = sc.wholeTextFiles
Hi all,
I have a requirement to process multiple splittable gzip files and the results
need to include each individual file name.
I come across a problem when loading multiple gzip files using wholeTextFiles
method and some files are corrupted causing ‘unexpected end of input stream'
error
Also, in case the issue was not due to the string length (however it
is still valid and may get you later), the issue may be due to some
other indexing issues which are currently being worked on here
https://issues.apache.org/jira/browse/SPARK-6235
On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky
Hi Pradeep,
I'm afraid you're running into a hard Java issue. Strings are indexed
with signed integers and can therefore not be longer than
approximately 2 billion characters. Could you use `textFile` as a
workaround? It will give you an RDD of the files' lines instead.
In general, this guide
Hi,
Why there is an restriction on max file size that can be read by
wholeTextFile() method.
I can read a 1.5 gigs file but get Out of memory for 2 gig file.
Also, how can I raise this as an defect in spark jira. Can someone please guide.
Thanks,
Pradeep
ments in `map` and `foreach` getting printed in cluster mode of
>>>> execution.
>>>>
>>>> I notice a particular line in standalone output that I do NOT see in
>>>> cluster execution.
>>>>
>>>> *16/09/07 17:35:35 INFO WholeTex
extFileRDD: Input split:
>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+
er execution.
>>
>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuse
5,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8
a similar code with textFile() that worked earlier for individual
files on cluster. The issue is with wholeTextFiles() only.
Please advise what is the best way to get this working or other alternate
ways.
My setup is cloudera 5.7 distribution with Spark Service. I used the master
as `yarn-cli
> HDFS and all it's content has to be read at one shot. So I'm using spark
> context's wholeTextFiles API passing the HDFS URL for the file.
>
> When I try this from a spark shell it's works as mentioned in the
> documentation, but when I try the same through program (by subm
I'm using Hadoop 1.0.4 and Spark 1.2.0.
I'm facing a strange issue. I have a requirement to read a small file from
HDFS and all it's content has to be read at one shot. So I'm using spark
context's wholeTextFiles API passing the HDFS URL for the file.
When I try this from a spark shell it's
.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user
In SparkUI I can see it creating 2 stages. I tried
wholeTextFiles().repartition(32) but same threading results.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html
Sent from the Apache Spark User List
I have 20 nodes via EC2 and an application that reads the data via
wholeTextFiles. I've tried to copy the data into hadoop via
copyFromLocal, and I get
14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in
createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad
connect ack
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue
seems related to Hadoop1. When switching to using
spark-1.0.2-bin-hadoop*2*, the issue disappears.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working
: File /MyBucket/MyFolder.tif does not
exist.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
That worked for me as well, I was using spark 1.0 compiled against Hadoop
1.0, switching to 1.0.1 compiled against hadoop 2
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html
Sent from the Apache Spark
You cannot read image files with wholeTextFiles because it uses
CombineFileInputFormat which cannot read gripped files because they are not
splittable http://www.bigdataspeak.com/2013_01_01_archive.html (source
proving it):
override def createRecordReader(
split: InputSplit,
context
Interesting question on Stack Overflow:
http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles
Is it possible to read gzipped files using wholeTextFiles()? Alternately,
is it possible to read the source file names using textFile()?
--
View
Is there an equivalent of wholeTextFiles for binary files for example a set
of images ?
Cheers,
Jaonary
.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Best Regards
---
Xusen Yin(尹绪森)
Intel Labs China
Homepage: *http://yinxusen.github.io/ http://yinxusen.github.io/*
I can write one if you'll point me to where I need to write it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, I have the same exception. Can you tell me how did you fix it? Thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
(SparkContext.scala:1094)
at org.apache.spark.rdd.RDD.collect(RDD.scala:717)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p7563.html
Sent from
, and everything else I've tried in Spark with that
version has worked, so I doubt it's a version error.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7570.html
Sent from the Apache Spark User List mailing list archive
-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
at
org.apache.spark.input.WholeTextFileRecordReader.init(WholeTextFileRecordReader.scala:40)
... 18 more
Any idea?
thanks
toivo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org
-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
)
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
thanks
toivo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang
41 matches
Mail list logo