Re: wholeTextFiles not working with HDFS
I forgot to say, I am using bin/spark-shell, spark-1.0.2 That host has scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_11) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12678.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: wholeTextFiles not working with HDFS
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue seems related to Hadoop1. When switching to using spark-1.0.2-bin-hadoop*2*, the issue disappears. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12677.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: wholeTextFiles not working with HDFS
That worked for me as well, I was using spark 1.0 compiled against Hadoop 1.0, switching to 1.0.1 compiled against hadoop 2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
I have the same issue val a = sc.textFile("s3n://MyBucket/MyFolder/*.tif") a.first works perfectly fine, but val d = sc.wholeTextFiles("s3n://MyBucket/MyFolder/*.tif") does not work d.first Gives the following error message java.io.FileNotFoundException: File /MyBucket/MyFolder.tif does not exist. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10505.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
I can write one if you'll point me to where I need to write it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
Hi Sguj and littlebird, I'll try to fix it tomorrow evening and the day after tomorrow, because I am now busy preparing a talk (slides) tomorrow. Sorry for the inconvenience to you. Would you mind to write an issue on Spark JIRA? 2014-06-17 20:55 GMT+08:00 Sguj : > I didn't fix the issue so much as work around it. I was running my cluster > locally, so using HDFS was just a preference. The code worked with the > local > file system, so that's what I'm using until I can get some help. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Best Regards --- Xusen Yin(尹绪森) Intel Labs China Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
Re: wholeTextFiles not working with HDFS
I didn't fix the issue so much as work around it. I was running my cluster locally, so using HDFS was just a preference. The code worked with the local file system, so that's what I'm using until I can get some help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
Hi, I have the same exception. Can you tell me how did you fix it? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
My exception stack looks about the same. java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094) at org.apache.spark.rdd.RDD.collect(RDD.scala:717) I'm using Hadoop 1.2.1, and everything else I've tried in Spark with that version has worked, so I doubt it's a version error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
Hi Sguj, Could you give me the exception stack? I test it on my laptop and find that it gets the wrong FileSystem. It should be DistributedFileSystem, but it finds the RawLocalFileSystem. If we get the same exception stack, I'll try to fix it. Here is my exception stack: java.io.FileNotFoundException: File /sen/reuters-out/reut2-000.sgm-0.txt does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097) at org.apache.spark.rdd.RDD.collect(RDD.scala:728) Besides, what's your hadoop version? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7548.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
wholeTextFiles not working with HDFS
I'm trying to get a list of every filename in a directory from HDFS using pySpark, and the only thing that seems like it would return the filenames is the wholeTextFiles function. My code for just trying to collect that data is this: files = sc.wholeTextFiles("hdfs://localhost:port/users/me/target") files = files.collect() These lines return the error "java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist" which makes it seem like the hdfs information isn't getting used with the wholeTextFiles function. Those lines work if I use them on a local filesystem directory, and the textFile() function works on the HDFS directory I'm trying to use wholeTextFiles() on. I need a way to either fix this, or an alternate method of reading the filenames from a directory in HDFS. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490.html Sent from the Apache Spark User List mailing list archive at Nabble.com.