I'm trying to get a list of every filename in a directory from HDFS using pySpark, and the only thing that seems like it would return the filenames is the wholeTextFiles function. My code for just trying to collect that data is this:
files = sc.wholeTextFiles("hdfs://localhost:port/users/me/target") files = files.collect() These lines return the error "java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist" which makes it seem like the hdfs information isn't getting used with the wholeTextFiles function. Those lines work if I use them on a local filesystem directory, and the textFile() function works on the HDFS directory I'm trying to use wholeTextFiles() on. I need a way to either fix this, or an alternate method of reading the filenames from a directory in HDFS. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490.html Sent from the Apache Spark User List mailing list archive at Nabble.com.