Re: hdfs streaming context

Andy Twigg Mon, 01 Dec 2014 14:49:47 -0800

Have you tried just passing a path to ssc.textFileStream() ? It
monitors the path for new files by looking at mtime/atime ; all
new/touched files in the time window appear as an rdd in the dstream.


On 1 December 2014 at 14:41, Benjamin Cuthbert <cuthbert....@gmail.com> wrote:
> All,
>
> Is it possible to stream on HDFS directory and listen for multiple files?
>
> I have tried the following
>
> val sparkConf = new SparkConf().setAppName("HdfsWordCount")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> val lines = ssc.textFileStream("hdfs://localhost:8020/user/data/*")
> lines.filter(line => line.contains("GE"))
> lines.print()
> ssc.start()
>
> But I get
>
> 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 
> 1417469742000 ms
> java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not 
> exist.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456)
>         at 
> org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
>         at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: hdfs streaming context

Reply via email to