RE: hdfs streaming context
Try (hdfs:///localhost:8020/user/data/*) With 3 /. Thx tri -Original Message- From: Benjamin Cuthbert [mailto:cuthbert@gmail.com] Sent: Monday, December 01, 2014 4:41 PM To: user@spark.apache.org Subject: hdfs streaming context All, Is it possible to stream on HDFS directory and listen for multiple files? I have tried the following val sparkConf = new SparkConf().setAppName(HdfsWordCount) val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*) lines.filter(line = line.contains(GE)) lines.print() ssc.start() But I get 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 1417469742000 ms java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hdfs streaming context
Have you tried just passing a path to ssc.textFileStream() ? It monitors the path for new files by looking at mtime/atime ; all new/touched files in the time window appear as an rdd in the dstream. On 1 December 2014 at 14:41, Benjamin Cuthbert cuthbert@gmail.com wrote: All, Is it possible to stream on HDFS directory and listen for multiple files? I have tried the following val sparkConf = new SparkConf().setAppName(HdfsWordCount) val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*) lines.filter(line = line.contains(GE)) lines.print() ssc.start() But I get 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 1417469742000 ms java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hdfs streaming context
Yes, in fact, that's the only way it works. You need hdfs://localhost:8020/user/data, I believe. (No it's not correct to write hdfs:///...) On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert cuthbert@gmail.com wrote: All, Is it possible to stream on HDFS directory and listen for multiple files? I have tried the following val sparkConf = new SparkConf().setAppName(HdfsWordCount) val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*) lines.filter(line = line.contains(GE)) lines.print() ssc.start() But I get 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 1417469742000 ms java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hdfs streaming context
Thanks Sean, That worked just removing the /* and leaving it as /user/data Seems to be streaming in. On 1 Dec 2014, at 22:50, Sean Owen so...@cloudera.com wrote: Yes, in fact, that's the only way it works. You need hdfs://localhost:8020/user/data, I believe. (No it's not correct to write hdfs:///...) On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert cuthbert@gmail.com wrote: All, Is it possible to stream on HDFS directory and listen for multiple files? I have tried the following val sparkConf = new SparkConf().setAppName(HdfsWordCount) val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*) lines.filter(line = line.contains(GE)) lines.print() ssc.start() But I get 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 1417469742000 ms java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: hdfs streaming context
For the streaming example I am working on, Its accepted (hdfs:///user/data) without the localhost info. Let me dig through my hdfs config. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, December 01, 2014 4:50 PM To: Benjamin Cuthbert Cc: user@spark.apache.org Subject: Re: hdfs streaming context Yes, in fact, that's the only way it works. You need hdfs://localhost:8020/user/data, I believe. (No it's not correct to write hdfs:///...) On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert cuthbert@gmail.com wrote: All, Is it possible to stream on HDFS directory and listen for multiple files? I have tried the following val sparkConf = new SparkConf().setAppName(HdfsWordCount) val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*) lines.filter(line = line.contains(GE)) lines.print() ssc.start() But I get 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 1417469742000 ms java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputD Stream.scala:75) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hdfs streaming context
Yes but you can't follow three slashes with host:port. No host probably defaults to whatever is found in your HDFS config. On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote: For the streaming example I am working on, Its accepted (hdfs:///user/data) without the localhost info. Let me dig through my hdfs config. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: hdfs streaming context
Yep. No localhost Usually, I use hdfs:///user/data to indicates I want hdfs or file:///user/data to indicates local file directory. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, December 01, 2014 5:06 PM To: Bui, Tri Cc: Benjamin Cuthbert; user@spark.apache.org Subject: Re: hdfs streaming context Yes but you can't follow three slashes with host:port. No host probably defaults to whatever is found in your HDFS config. On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote: For the streaming example I am working on, Its accepted (hdfs:///user/data) without the localhost info. Let me dig through my hdfs config.