RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
Try 

(hdfs:///localhost:8020/user/data/*) 

With 3 /.

Thx
tri

-Original Message-
From: Benjamin Cuthbert [mailto:cuthbert@gmail.com] 
Sent: Monday, December 01, 2014 4:41 PM
To: user@spark.apache.org
Subject: hdfs streaming context

All,

Is it possible to stream on HDFS directory and listen for multiple files?

I have tried the following

val sparkConf = new SparkConf().setAppName(HdfsWordCount)
val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = 
ssc.textFileStream(hdfs://localhost:8020/user/data/*)
lines.filter(line = line.contains(GE))
lines.print()
ssc.start()

But I get

14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 
1417469742000 ms
java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not 
exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456)
at 
org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
at 
org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hdfs streaming context

2014-12-01 Thread Andy Twigg
Have you tried just passing a path to ssc.textFileStream() ? It
monitors the path for new files by looking at mtime/atime ; all
new/touched files in the time window appear as an rdd in the dstream.

On 1 December 2014 at 14:41, Benjamin Cuthbert cuthbert@gmail.com wrote:
 All,

 Is it possible to stream on HDFS directory and listen for multiple files?

 I have tried the following

 val sparkConf = new SparkConf().setAppName(HdfsWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(2))
 val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*)
 lines.filter(line = line.contains(GE))
 lines.print()
 ssc.start()

 But I get

 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 
 1417469742000 ms
 java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not 
 exist.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456)
 at 
 org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
 at 
 org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hdfs streaming context

2014-12-01 Thread Sean Owen
Yes, in fact, that's the only way it works. You need
hdfs://localhost:8020/user/data, I believe.

(No it's not correct to write hdfs:///...)

On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert
cuthbert@gmail.com wrote:
 All,

 Is it possible to stream on HDFS directory and listen for multiple files?

 I have tried the following

 val sparkConf = new SparkConf().setAppName(HdfsWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(2))
 val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*)
 lines.filter(line = line.contains(GE))
 lines.print()
 ssc.start()

 But I get

 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 
 1417469742000 ms
 java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not 
 exist.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456)
 at 
 org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
 at 
 org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hdfs streaming context

2014-12-01 Thread Benjamin Cuthbert
Thanks Sean,

That worked just removing the /* and leaving it as /user/data

Seems to be streaming in.


 On 1 Dec 2014, at 22:50, Sean Owen so...@cloudera.com wrote:
 
 Yes, in fact, that's the only way it works. You need
 hdfs://localhost:8020/user/data, I believe.
 
 (No it's not correct to write hdfs:///...)
 
 On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert
 cuthbert@gmail.com wrote:
 All,
 
 Is it possible to stream on HDFS directory and listen for multiple files?
 
 I have tried the following
 
 val sparkConf = new SparkConf().setAppName(HdfsWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(2))
 val lines = ssc.textFileStream(hdfs://localhost:8020/user/data/*)
 lines.filter(line = line.contains(GE))
 lines.print()
 ssc.start()
 
 But I get
 
 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 
 1417469742000 ms
 java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does 
 not exist.
at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456)
at 
 org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
at 
 org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
For the streaming example I am working on, Its accepted (hdfs:///user/data) 
without the localhost info.  

Let me dig through my hdfs config.





-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Monday, December 01, 2014 4:50 PM
To: Benjamin Cuthbert
Cc: user@spark.apache.org
Subject: Re: hdfs streaming context

Yes, in fact, that's the only way it works. You need 
hdfs://localhost:8020/user/data, I believe.

(No it's not correct to write hdfs:///...)

On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert cuthbert@gmail.com 
wrote:
 All,

 Is it possible to stream on HDFS directory and listen for multiple files?

 I have tried the following

 val sparkConf = new SparkConf().setAppName(HdfsWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = 
 ssc.textFileStream(hdfs://localhost:8020/user/data/*)
 lines.filter(line = line.contains(GE))
 lines.print()
 ssc.start()

 But I get

 14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 
 1417469742000 ms
 java.io.FileNotFoundException: File hdfs://localhost:8020/user/data/*does not 
 exist.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:408)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1416)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1456)
 at 
 org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
 at 
 org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputD
 Stream.scala:75)
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



Re: hdfs streaming context

2014-12-01 Thread Sean Owen
Yes but you can't follow three slashes with host:port. No host
probably defaults to whatever is found in your HDFS config.

On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote:
 For the streaming example I am working on, Its accepted (hdfs:///user/data) 
 without the localhost info.

 Let me dig through my hdfs config.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
Yep. No localhost

Usually, I use hdfs:///user/data to indicates I want hdfs  or file:///user/data 
to indicates local file directory.  



-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Monday, December 01, 2014 5:06 PM
To: Bui, Tri
Cc: Benjamin Cuthbert; user@spark.apache.org
Subject: Re: hdfs streaming context

Yes but you can't follow three slashes with host:port. No host probably 
defaults to whatever is found in your HDFS config.

On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote:
 For the streaming example I am working on, Its accepted (hdfs:///user/data) 
 without the localhost info.

 Let me dig through my hdfs config.