[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

Jayson Sunshine (JIRA) Wed, 25 Mar 2015 21:02:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381315#comment-14381315
 ]


Jayson Sunshine commented on SPARK-4414:
----------------------------------------

Is this issue related to the input source being compressed? Can wholeTextFiles 
handle compressed files similarly to textFile?

Pedro, I infer from your file name of 'myfile.txt' that it was not compressed. 
Is this true?

Phatak, in your gist you are grabbing a part file from s3 whereas Pedro was 
trying to read off a 'whole' file name. Do you guys think this matters?

I, too, cannot read with wholeTextFiles files on s3 that I can read with 
textFile.

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> --------------------------------------------------------
>
>                 Key: SPARK-4414
>                 URL: https://issues.apache.org/jira/browse/SPARK-4414
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Pedro Rodriguez
>            Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>       at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

Reply via email to