[ https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381315#comment-14381315 ]
Jayson Sunshine commented on SPARK-4414: ---------------------------------------- Is this issue related to the input source being compressed? Can wholeTextFiles handle compressed files similarly to textFile? Pedro, I infer from your file name of 'myfile.txt' that it was not compressed. Is this true? Phatak, in your gist you are grabbing a part file from s3 whereas Pedro was trying to read off a 'whole' file name. Do you guys think this matters? I, too, cannot read with wholeTextFiles files on s3 that I can read with textFile. > SparkContext.wholeTextFiles Doesn't work with S3 Buckets > -------------------------------------------------------- > > Key: SPARK-4414 > URL: https://issues.apache.org/jira/browse/SPARK-4414 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.0, 1.2.0 > Reporter: Pedro Rodriguez > Priority: Critical > > SparkContext.wholeTextFiles does not read files which SparkContext.textFile > can read. Below are general steps to reproduce, my specific case is following > that on a git repo. > Steps to reproduce. > 1. Create Amazon S3 bucket, make public with multiple files > 2. Attempt to read bucket with > sc.wholeTextFiles("s3n://mybucket/myfile.txt") > 3. Spark returns the following error, even if the file exists. > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: /myfile.txt > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489) > 4. Change the call to > sc.textFile("s3n://mybucket/myfile.txt") > and there is no error message, the application should run fine. > There is a question on StackOverflow as well on this: > http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist > This is link to repo/lines of code. The uncommented call doesn't work, the > commented call works as expected: > https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19 > It would be easy to use textFile with a multifile argument, but this should > work correctly for s3 bucket files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org