I'm trying to get a list of every filename in a directory from HDFS using
pySpark, and the only thing that seems like it would return the filenames is
the wholeTextFiles function. My code for just trying to collect that data is
this:

       files = sc.wholeTextFiles("hdfs://localhost:port/users/me/target")
       files = files.collect()

These lines return the error "java.io.FileNotFoundException: File
/user/me/target/capacity-scheduler.xml does not exist" which makes it seem
like the hdfs information isn't getting used with the wholeTextFiles
function. 

Those lines work if I use them on a local filesystem directory, and the
textFile() function works on the HDFS directory I'm trying to use
wholeTextFiles() on.

I need a way to either fix this, or an alternate method of reading the
filenames from a directory in HDFS.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to