[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2017-04-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952201#comment-15952201
 ] 

Steve Loughran commented on SPARK-6527:
---

Hadoop 2.8.0 is out the door, try against those JARs before filing bugreports 
against the HADOOP- module. If you do find a problem, include as much as you 
can, ideally logging {{org.apache.hadoop.fs.s3a.S3AFileSystem}} at debug, and 
mark as a dependency of HADOOP-13204. Thanks

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2016-04-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259966#comment-15259966
 ] 

Steve Loughran commented on SPARK-6527:
---

Actually, looking at {{SparkContext.binaryFiles()}}, this could just be 
SPARK-7155 surfacing. 

Does this happen on Spark >= 1.3.2? Ideally, checking on 1.6.1+?

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2016-04-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259963#comment-15259963
 ] 

Steve Loughran commented on SPARK-6527:
---

I've not seen a JIRA surface;

#  if anyone does, link it to HADOOP-11694, S3a Phase II, which I'm trying to 
wrap up this week.
# what are the characters in question?
# if it's not just when there are complex characters in a name, how many files 
in a directory tree does it take to trigger this problem.


looking into the Hadoop code, this specific error string appears if there is no 
match on a path containing a pattern, 
{code}
  Path p = dirs[i];
  FileSystem fs = p.getFileSystem(job.getConfiguration()); 
  FileStatus[] matches = fs.globStatus(p, inputFilter);
  if (matches == null) {
errors.add(new IOException("Input path does not exist: " + p));
  } else if (matches.length == 0) {
errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
...
{code}

It might be that odd chars in filenames are confusing that pattern matching


> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2016-04-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141
 ] 

Nicholas Chammas commented on SPARK-6527:
-

Did the s3a suggestion work? If not, did anybody file an issue as Steve 
suggested with more detail?

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965006#comment-14965006
 ] 

Steve Loughran commented on SPARK-6527:
---

try using s3a instead of S3n (ideally, hadoop 2.7+); it may have better 
character support. Otherwise, file a JIRa on hadoop common with component = 
{{fs/s3}} listing an example path which isn't valid.

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2015-10-19 Thread bin wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964267#comment-14964267
 ] 

bin wang commented on SPARK-6527:
-

[~zhaozhang], this errors happens to me too while I am using Databricks' 
notebook. I have tons of images in a bucket, say `mybucket` wher when I do 
`binaryfiles('mybucket/*')`, it will error out with same message as yours. 
However, some of the images contain special characters that when I do 
`binaryfiles('mybucket/00*.jpg')` to restrict to a very small number of images, 
the command ran successfully. 

In that case, I think there is probably something picky about the file names 
containing certain characters. 

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2015-05-18 Thread Ewen Cheslack-Postava (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549562#comment-14549562
 ] 

Ewen Cheslack-Postava commented on SPARK-6527:
--

Here's a stack trace, which looks like it's incorrectly trying to use the local 
filesystem to open the file:

{quote}
java.io.FileNotFoundException: File /path/to/file does not exist.
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:44)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)
at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
at $iwC$$iwC$$iwC$$iwC.init(console:35)
at $iwC$$iwC$$iwC.init(console:37)
at $iwC$$iwC.init(console:39)
at $iwC.init(console:41)
at init(console:43)
at .init(console:47)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at