Hi All, I apologize for reposting, I wonder if anyone can explain this behavior? And what would be the best way to resolve this without introducing something like kafka in the midst. I basically have a logstash instance, and would like to stream output of logstash to spark_streaming without introducing a new message passing service like kafka/redis in the midst.
We will eventually probably use kafka, but for now I need guaranteed delivery. For the tail -f <logfile> |nc -lk 9999 command, I wait for a significant time after spark stops receiving any data in it's microbatches. I confirm that it's not getting any data, i.e. the file end has probably been reached by printing the first two lines of every micro-batch. Thanks Nipun On Mon, Feb 8, 2016 at 10:05 PM Nipun Arora <nipunarora2...@gmail.com> wrote: > I have a spark-streaming service, where I am processing and detecting > anomalies on the basis of some offline generated model. I feed data into > this service from a log file, which is streamed using the following command > > tail -f <logfile>| nc -lk 9999 > > Here the spark streaming service is taking data from port 9999. Once spark > has finished processing, and is showing that it is processing empty > micro-batches, I kill both spark, and the netcat process above. However, I > observe that the last few lines are being dropped in some cases, i.e. spark > streaming does not receive those log lines or they are not processed. > > However, I also observed that if I simply take the logfile as standard > input instead of tailing it, the connection is closed at the end of the > file, and no lines are dropped: > > nc -q 10 -lk 9999 < logfile > > Can anyone explain why this behavior is happening? And what could be a > better resolution to the problem of streaming log data to spark streaming > instance? > > > Thanks > > Nipun >