I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command
tail -f <logfile>| nc -lk 9999 Here the spark streaming service is taking data from port 9999. Once spark has finished processing, and is showing that it is processing empty micro-batches, I kill both spark, and the netcat process above. However, I observe that the last few lines are being dropped in some cases, i.e. spark streaming does not receive those log lines or they are not processed. However, I also observed that if I simply take the logfile as standard input instead of tailing it, the connection is closed at the end of the file, and no lines are dropped: nc -q 10 -lk 9999 < logfile Can anyone explain why this behavior is happening? And what could be a better resolution to the problem of streaming log data to spark streaming instance? Thanks Nipun