I moved them every interval to the monitored directory.
Patcharee
On 25. jan. 2016 22:30, Shixiong(Ryan) Zhu wrote:
Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/",
or write into it directly? `textFileStream` requires that files must
be written to the monitored directory by "moving" them from another
location within the same file system.
On Mon, Jan 25, 2016 at 6:30 AM, patcharee <patcharee.thong...@uni.no
<mailto:patcharee.thong...@uni.no>> wrote:
Hi,
My streaming application is receiving data from file system and
just prints the input count every 1 sec interval, as the code below:
val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms))
val lines = ssc.textFileStream(args(0))
lines.count().print()
The problem is sometimes the data received from scc.textFileStream
is ONLY ONE line. But in fact there are multiple lines in the new
file found in that interval. See log below which shows three
intervals. In the 2nd interval, the new file is:
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This
file contains 6288 lines. The ssc.textFileStream returns ONLY ONE
line (the header).
Any ideas/suggestions what the problem is?
-----------------------------------------------------------------------------------------
SPARK LOG
-----------------------------------------------------------------------------------------
16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that
were older than 1453731011000 ms: 1453731010000 ms
16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that
were older than 1453731011000 ms:
16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms
16/01/25 15:11:12 INFO FileInputDStream: New files at time
1453731072000 ms:
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt
-------------------------------------------
Time: 1453731072000 ms
-------------------------------------------
6288
16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that
were older than 1453731012000 ms: 1453731011000 ms
16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that
were older than 1453731012000 ms:
16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms
16/01/25 15:11:13 INFO FileInputDStream: New files at time
1453731073000 ms:
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt
-------------------------------------------
Time: 1453731073000 ms
-------------------------------------------
1
16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that
were older than 1453731013000 ms: 1453731012000 ms
16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that
were older than 1453731013000 ms:
16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms
16/01/25 15:11:14 INFO FileInputDStream: New files at time
1453731074000 ms:
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt
-------------------------------------------
Time: 1453731074000 ms
-------------------------------------------
6288
Thanks,
Patcharee