Any possibility that this file is still written by other application, so what Spark Streaming processed is an incomplete file.
On Tue, Jan 26, 2016 at 5:30 AM, Shixiong(Ryan) Zhu <shixi...@databricks.com > wrote: > Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or > write into it directly? `textFileStream` requires that files must be > written to the monitored directory by "moving" them from another location > within the same file system. > > On Mon, Jan 25, 2016 at 6:30 AM, patcharee <patcharee.thong...@uni.no> > wrote: > >> Hi, >> >> My streaming application is receiving data from file system and just >> prints the input count every 1 sec interval, as the code below: >> >> val sparkConf = new SparkConf() >> val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms)) >> val lines = ssc.textFileStream(args(0)) >> lines.count().print() >> >> The problem is sometimes the data received from scc.textFileStream is >> ONLY ONE line. But in fact there are multiple lines in the new file found >> in that interval. See log below which shows three intervals. In the 2nd >> interval, the new file is: >> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This file >> contains 6288 lines. The ssc.textFileStream returns ONLY ONE line (the >> header). >> >> Any ideas/suggestions what the problem is? >> >> >> ----------------------------------------------------------------------------------------- >> SPARK LOG >> >> ----------------------------------------------------------------------------------------- >> >> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that were >> older than 1453731011000 ms: 1453731010000 ms >> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that were >> older than 1453731011000 ms: >> 16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms >> 16/01/25 15:11:12 INFO FileInputDStream: New files at time 1453731072000 >> ms: >> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt >> ------------------------------------------- >> Time: 1453731072000 ms >> ------------------------------------------- >> 6288 >> >> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that were >> older than 1453731012000 ms: 1453731011000 ms >> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that were >> older than 1453731012000 ms: >> 16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms >> 16/01/25 15:11:13 INFO FileInputDStream: New files at time 1453731073000 >> ms: >> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt >> ------------------------------------------- >> Time: 1453731073000 ms >> ------------------------------------------- >> 1 >> >> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that were >> older than 1453731013000 ms: 1453731012000 ms >> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that were >> older than 1453731013000 ms: >> 16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms >> 16/01/25 15:11:14 INFO FileInputDStream: New files at time 1453731074000 >> ms: >> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt >> ------------------------------------------- >> Time: 1453731074000 ms >> ------------------------------------------- >> 6288 >> >> >> Thanks, >> Patcharee >> > >