Any possibility that this file is still written by other application, so
what Spark Streaming processed is an incomplete file.

On Tue, Jan 26, 2016 at 5:30 AM, Shixiong(Ryan) Zhu <shixi...@databricks.com
> wrote:

> Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or
> write into it directly? `textFileStream` requires that files must be
> written to the monitored directory by "moving" them from another location
> within the same file system.
>
> On Mon, Jan 25, 2016 at 6:30 AM, patcharee <patcharee.thong...@uni.no>
> wrote:
>
>> Hi,
>>
>> My streaming application is receiving data from file system and just
>> prints the input count every 1 sec interval, as the code below:
>>
>> val sparkConf = new SparkConf()
>> val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms))
>> val lines = ssc.textFileStream(args(0))
>> lines.count().print()
>>
>> The problem is sometimes the data received from scc.textFileStream is
>> ONLY ONE line. But in fact there are multiple lines in the new file found
>> in that interval. See log below which shows three intervals. In the 2nd
>> interval, the new file is:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This file
>> contains 6288 lines. The ssc.textFileStream returns ONLY ONE line (the
>> header).
>>
>> Any ideas/suggestions what the problem is?
>>
>>
>> -----------------------------------------------------------------------------------------
>> SPARK LOG
>>
>> -----------------------------------------------------------------------------------------
>>
>> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that were
>> older than 1453731011000 ms: 1453731010000 ms
>> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that were
>> older than 1453731011000 ms:
>> 16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms
>> 16/01/25 15:11:12 INFO FileInputDStream: New files at time 1453731072000
>> ms:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt
>> -------------------------------------------
>> Time: 1453731072000 ms
>> -------------------------------------------
>> 6288
>>
>> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that were
>> older than 1453731012000 ms: 1453731011000 ms
>> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that were
>> older than 1453731012000 ms:
>> 16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms
>> 16/01/25 15:11:13 INFO FileInputDStream: New files at time 1453731073000
>> ms:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt
>> -------------------------------------------
>> Time: 1453731073000 ms
>> -------------------------------------------
>> 1
>>
>> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that were
>> older than 1453731013000 ms: 1453731012000 ms
>> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that were
>> older than 1453731013000 ms:
>> 16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms
>> 16/01/25 15:11:14 INFO FileInputDStream: New files at time 1453731074000
>> ms:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt
>> -------------------------------------------
>> Time: 1453731074000 ms
>> -------------------------------------------
>> 6288
>>
>>
>> Thanks,
>> Patcharee
>>
>
>

Reply via email to