Hi All,

I apologize for reposting, I wonder if anyone can explain this behavior?
And what would be the best way to resolve this without introducing
something like kafka in the midst.
I basically have a logstash instance, and would like to stream output of
logstash to spark_streaming without introducing a new message passing
service like kafka/redis in the midst.

We will eventually probably use kafka, but for now I need guaranteed
delivery.

For the tail -f <logfile> |nc -lk 9999 command, I wait for a significant
time after spark stops receiving any data in it's microbatches. I confirm
that it's not getting any data, i.e. the file end has probably been reached
by printing the first two lines of every micro-batch.

Thanks
Nipun



On Mon, Feb 8, 2016 at 10:05 PM Nipun Arora <nipunarora2...@gmail.com>
wrote:

> I have a spark-streaming service, where I am processing and detecting
> anomalies on the basis of some offline generated model. I feed data into
> this service from a log file, which is streamed using the following command
>
> tail -f <logfile>| nc -lk 9999
>
> Here the spark streaming service is taking data from port 9999. Once spark
> has finished processing, and is showing that it is processing empty
> micro-batches, I kill both spark, and the netcat process above. However, I
> observe that the last few lines are being dropped in some cases, i.e. spark
> streaming does not receive those log lines or they are not processed.
>
> However, I also observed that if I simply take the logfile as standard
> input instead of tailing it, the connection is closed at the end of the
> file, and no lines are dropped:
>
> nc -q 10 -lk 9999 < logfile
>
> Can anyone explain why this behavior is happening? And what could be a
> better resolution to the problem of streaming log data to spark streaming
> instance?
>
>
> Thanks
>
> Nipun
>

Reply via email to