Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or write into it directly? `textFileStream` requires that files must be written to the monitored directory by "moving" them from another location within the same file system.
On Mon, Jan 25, 2016 at 6:30 AM, patcharee <patcharee.thong...@uni.no> wrote: > Hi, > > My streaming application is receiving data from file system and just > prints the input count every 1 sec interval, as the code below: > > val sparkConf = new SparkConf() > val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms)) > val lines = ssc.textFileStream(args(0)) > lines.count().print() > > The problem is sometimes the data received from scc.textFileStream is ONLY > ONE line. But in fact there are multiple lines in the new file found in > that interval. See log below which shows three intervals. In the 2nd > interval, the new file is: > hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This file > contains 6288 lines. The ssc.textFileStream returns ONLY ONE line (the > header). > > Any ideas/suggestions what the problem is? > > > ----------------------------------------------------------------------------------------- > SPARK LOG > > ----------------------------------------------------------------------------------------- > > 16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that were > older than 1453731011000 ms: 1453731010000 ms > 16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that were > older than 1453731011000 ms: > 16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms > 16/01/25 15:11:12 INFO FileInputDStream: New files at time 1453731072000 > ms: > hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt > ------------------------------------------- > Time: 1453731072000 ms > ------------------------------------------- > 6288 > > 16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that were > older than 1453731012000 ms: 1453731011000 ms > 16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that were > older than 1453731012000 ms: > 16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms > 16/01/25 15:11:13 INFO FileInputDStream: New files at time 1453731073000 > ms: > hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt > ------------------------------------------- > Time: 1453731073000 ms > ------------------------------------------- > 1 > > 16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that were > older than 1453731013000 ms: 1453731012000 ms > 16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that were > older than 1453731013000 ms: > 16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms > 16/01/25 15:11:14 INFO FileInputDStream: New files at time 1453731074000 > ms: > hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt > ------------------------------------------- > Time: 1453731074000 ms > ------------------------------------------- > 6288 > > > Thanks, > Patcharee >