I have a spark-streaming service, where I am processing and detecting
anomalies on the basis of some offline generated model. I feed data into
this service from a log file, which is streamed using the following command

tail -f <logfile>| nc -lk 9999

Here the spark streaming service is taking data from port 9999. Once spark
has finished processing, and is showing that it is processing empty
micro-batches, I kill both spark, and the netcat process above. However, I
observe that the last few lines are being dropped in some cases, i.e. spark
streaming does not receive those log lines or they are not processed.

However, I also observed that if I simply take the logfile as standard
input instead of tailing it, the connection is closed at the end of the
file, and no lines are dropped:

nc -q 10 -lk 9999 < logfile

Can anyone explain why this behavior is happening? And what could be a
better resolution to the problem of streaming log data to spark streaming
instance?


Thanks

Nipun

Reply via email to