Have you tried just passing a path to ssc.textFileStream() ? It monitors the path for new files by looking at mtime/atime ; all new/touched files in the time window appear as an rdd in the dstream.
On 1 December 2014 at 14:41, Benjamin Cuthbert <cuthbert....@gmail.com> wrote: > All, > > Is it possible to stream on HDFS directory and listen for multiple files? > > I have tried the following > > val sparkConf = new SparkConf().setAppName("HdfsWordCount") > val ssc = new StreamingContext(sparkConf, Seconds(2)) > val lines = ssc.textFileStream("hdfs://localhost:8020/user/data/*") > lines.filter(line => line.contains("GE")) > lines.print() > ssc.start() > > But I get > > 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time > 1417469742000 ms > java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not > exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456) > at > org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75) > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org